Publications from Computer Vision Laboratory
Journal papers
Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network (CNN) and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection (CD) framework, efficient global and local context aggregation network (ELGC-Net), which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an efficient local-global context aggregator (ELGCA) module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union (IoU) metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art (SOTA) performance in remote sensing CD benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance.
@article{diva2:1851984,
author = {Noman, Mubashir and Fiaz, Mustansar and Cholakkal, Hisham and Khan, Salman and Khan, Fahad},
title = {{ELGC-Net: Efficient Local- Global Context Aggregation for Remote Sensing Change Detection}},
journal = {IEEE Transactions on Geoscience and Remote Sensing},
year = {2024},
volume = {62},
}
During the operation of industrial robots, unusual events may endanger the safety of humans and the quality of production. When collecting data to detect such cases, it is not ensured that data from all potentially occurring errors is included as unforeseeable events may happen over time. Therefore, anomaly detection (AD) delivers a practical solution, using only normal data to learn to detect unusual events. We introduce a dataset that allows training and benchmarking of anomaly detection methods for robotic applications based on machine data which will be made publicly available to the research community. As a typical robot task the dataset includes a pick-and-place application which involves movement, actions of the end effector, and interactions with the objects of the environment. Since several of the contained anomalies are not task-specific but general, evaluations on our dataset are transferable to other robotics applications as well. In addition, we present multivariate time-series flow (MVT-Flow) as a new baseline method for anomaly detection: It relies on deep-learning-based density estimation with normalizing flows, tailored to the data domain by taking its structure into account for the architecture. Our evaluation shows that MVT-Flow outperforms baselines from previous work by a large margin of 6.2% in area under receiving operator characteristic.
@article{diva2:1845839,
author = {Brockmann, Jan Thies and Rudolph, Marco and Rosenhahn, Bodo and Wandt, Bastian},
title = {{The voraus-AD Dataset for Anomaly Detection in Robot Applications}},
journal = {IEEE Transactions on robotics},
year = {2024},
volume = {40},
pages = {438--451},
}
This paper proposes a scribble-based weakly supervised RGB-D salient object detection (SOD) method to relieve the annotation burden from pixel-wise annotations. In view of the ensuing performance drop, we summarize two natural deficiencies of the scribbles and try to alleviate them, which are the weak richness of the pixel training samples (WRPS) and the poor structural integrity of the salient objects (PSIO). WRPS hinders robust saliency perception learning, which can be alleviated via model design for robust feature learning and pseudo labels generation for training sample enrichment. Specifically, we first design a dynamic searching process module as a meta operation to conduct multi-scale and multi-modal feature fusion for the robust RGB-D SOD model construction. Then, a dual-branch consistency learning mechanism is proposed to generate enough pixel training samples for robust saliency perception learning. PSIO makes direct structural learning infeasible since scribbles can not provide integral structural supervision. Thus, we propose an edge-region structure-refinement loss to recover the structural information and make precise segmentation. We deploy all components and conduct ablation studies on two baselines to validate their effectiveness and generalizability. Experimental results on eight datasets show that our method outperforms other scribble-based SOD models and achieves comparable performance with fully supervised state-of-the-art methods.
@article{diva2:1829437,
author = {Li, Long and Han, Junwei and Liu, Nian and Khan, Salman and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Fahad},
title = {{Robust Perception and Precise Segmentation for Scribble-Supervised RGB-D Saliency Detection}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2024},
volume = {46},
number = {1},
pages = {479--496},
}
Anomaly detection has recently gained increasing attention in the field of computer vision, likely due to its broad set of applications ranging from product fault detection on industrial production lines and impending event detection in video surveillance to finding lesions in medical scans. Regardless of the domain, anomaly detection is typically framed as a one-class classification task, where the learning is conducted on normal examples only. An entire family of successful anomaly detection methods is based on learning to reconstruct masked normal inputs (e.g. patches, future frames, etc.) and exerting the magnitude of the reconstruction error as an indicator for the abnormality level. Unlike other reconstruction-based methods, we present a novel self-supervised masked convolutional transformer block (SSMCTB) that comprises the reconstruction-based functionality at a core architectural level. The proposed self-supervised block is extremely flexible, enabling information masking at any layer of a neural network and being compatible with a wide range of neural architectures. In this work, we extend our previous self-supervised predictive convolutional attentive block (SSPCAB) with a 3D masked convolutional layer, a transformer for channel-wise attention, as well as a novel self-supervised objective based on Huber loss. Furthermore, we show that our block is applicable to a wider variety of tasks, adding anomaly detection in medical images and thermal videos to the previously considered tasks based on RGB images and surveillance videos. We exhibit the generality and flexibility of SSMCTB by integrating it into multiple state-of-the-art neural models for anomaly detection, bringing forth empirical results that confirm considerable performance improvements on five benchmarks: MVTec AD, BRATS, Avenue, ShanghaiTech, and Thermal Rare Event.
@article{diva2:1825529,
author = {Madan, Neelu and Ristea, Nicolae-Catalin and Ionescu, Radu Tudor and Nasrollahi, Kamal and Khan, Fahad and Moeslund, Thomas B. and Shah, Mubarak},
title = {{Self-Supervised Masked Convolutional Transformer Block for Anomaly Detection}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2024},
volume = {46},
number = {1},
pages = {525--542},
}
We propose a novel self-supervised Video Object Segmentation (VOS) approach that strives to achieve better visual correspondence across frames and object-background discriminability for accurate object segmentation. Distinct from previous self-supervised VOS methods, our approach is based on a learning loss formulation that takes into account both object and background information to ensure object-background discriminability, rather than using only object appearance. The objective function comprises cutout-based reconstruction (cutout region represents part of a frame, whose pixels are replaced with some constant values) and tag prediction loss terms. The cutout-based reconstruction term utilizes a simple cutout scheme to learn the pixel-wise correspondence between the current and previous frames in order to reconstruct the original current frame with added cutout region in it. The introduced cutout patch guides the model to focus on the reappearance of scene parts , thereby implicitly equipping the model to address occlusion-based scenarios Next, the tag prediction term encourages object-background separability by grouping tags of all pixels in the cutout region that are similar, while separating them from the tags of the rest of the reconstructed pixels. Additionally, we introduce a zoom-in scheme that addresses the problem of small object segmentation by capturing fine structural information at multiple scales. Our proposed approach, termed CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS. A detailed ablation showcases the importance of the proposed loss formulation to effectively establish correspondences, object-background discriminability, and the impact of our zoom-in scheme to accurately segment small-sized objects.
@article{diva2:1816606,
author = {Kini, Jyoti and Khan, Fahad and Khan, Salman and Shah, Mubarak},
title = {{CT-VOS: Cutout prediction and tagging for self-supervised video object segmentation}},
journal = {Computer Vision and Image Understanding},
year = {2024},
volume = {238},
}
For most applications, 2D keypoint detection works well and offers a simple and fast tool to analyse human movements. However, there remain many situations where even the best state-of-the-art algorithms reach their limits and fail to detect human keypoints correctly. Such situations may occur especially when individual body parts are occluded, twisted, or when the whole person is flipped. Especially when analysing injuries in alpine ski racing, such twisted and rotated body positions occur frequently. To improve the detection of keypoints for this application, we developed a novel method that refines keypoint estimates by rotating the input videos. We select the best rotation for every frame with a graph-based global solver. Thereby, we improve keypoint detection of an arbitrary pose estimation algorithm, in particular for 'hard' keypoints. In the current proof-of-concept study, we show that our approach outperforms standard keypoint detection results in all categories and in all metrics, in injury-related out-of-balance and fall situations by a large margin as well as previous methods, in performance and robustness. The Injury Ski II dataset was made publicly available, aiming to facilitate the investigation of sports accidents based on computer vision in the future.
@article{diva2:1852743,
author = {Zwölfer, Michael and Heinrich, Dieter and Wandt, Bastian and Rhodin, Helge and Spörri, Jörg and Nachbauer, Werner},
title = {{A graph-based approach can improve keypoint detection of complex poses: a proof-of-concept on injury occurrences in alpine ski racing}},
journal = {Scientific Reports},
year = {2023},
volume = {13},
number = {1},
}
Multi-label zero-shot learning strives to classify images into multiple unseen categories for which no data is available during training. The test samples can additionally contain seen categories in the generalized variant. Existing approaches rely on learning either shared or label-specific attention from the seen classes. Nevertheless, computing reliable attention maps for unseen classes during inference in a multi-label setting is still a challenge. In contrast, state-of-the-art single-label generative adversarial network (GAN) based approaches learn to directly synthesize the class-specific visual features from the corresponding class attribute embeddings. However, synthesizing multi-label features from GANs is still unexplored in the context of zero-shot setting. When multiple objects occur jointly in a single image, a critical question is how to effectively fuse multi-class information. In this work, we introduce different fusion approaches at the attribute-level, feature-level and cross-level (across attribute and feature-levels) for synthesizing multi-label features from their corresponding multi-label class embeddings. To the best of our knowledge, our work is the first to tackle the problem of multi-label feature synthesis in the (generalized) zero-shot setting. Our cross-level fusion-based generative approach outperforms the state-of-the-art on three zero-shot benchmarks: NUS-WIDE, Open Images and MS COCO. Furthermore, we show the generalization capabilities of our fusion approach in the zero-shot detection task on MS COCO, achieving favorable performance against existing methods.
@article{diva2:1840200,
author = {Gupta, Akshita and Narayan, Sanath and Khan, Salman and Khan, Fahad and Shao, Ling and van de Weijer, Joost},
title = {{Generative Multi-Label Zero-Shot Learning}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2023},
volume = {45},
number = {12},
pages = {14611--14624},
}
Multiobject tracking (MOT) is the problem of tracking the state of an unknown and time-varying number of objects using noisy measurements, with important applications, such as autonomous driving, tracking animal behavior, defense systems, and others. The MOT task can be divided into two settings, model based or model free, depending on whether accurate and tractable models of the environment are available. Model-based MOT has Bayes-optimal closed-form solutions, which can achieve state-of-the-art (SOTA) performance. However, these methods require approximations in challenging scenarios to remain tractable, which impairs their performance. Deep learning (DL) methods offer a promising alternative, but existing DL models are almost exclusively designed for a model-free setting and are not easily translated to the model-based setting. This article proposes a DL-based tracker specifically tailored to the model-based MOT setting and provides a thorough comparison to SOTA alternatives. We show that our DL-based tracker is able to match performance to the benchmarks in simple tracking tasks while outperforming the alternatives as the tasks become more challenging. These findings provide strong evidence of the applicability of DL also to the model-based setting, which we hope will foster further research in this direction.
@article{diva2:1835633,
author = {Pinto, Juliano and Hess, Georg and Ljungbergh, William and Xia, Yuxuan and Wymeersch, Henk and Svensson, Lennart},
title = {{Deep Learning for Model-Based Multiobject Tracking}},
journal = {IEEE Transactions on Aerospace and Electronic Systems},
year = {2023},
volume = {59},
number = {6},
pages = {7363--7379},
}
Motion prediction systems play a crucial role in enabling autonomous vehicles to navigate safely and efficiently in complex traffic scenarios. Graph Neural Network (GNN)-based approaches have emerged as a promising solution for capturing interactions among dynamic agents and static objects. However, they often lack transparency, interpretability and explainability - qualities that are essential for building trust in autonomous driving systems. In this work, we address this challenge by presenting a comprehensive approach to enhance the explainability of graph-based motion prediction systems. We introduce the Explainable Heterogeneous Graph based Policy (XHGP) model based on an heterogeneous graph representation of the traffic scene and lane-graph traversals. Distinct from other graph-based models, XHGP leverages object level and type-level attention mechanisms to learn interaction behaviors, providing information about the importance of agents and interactions in the scene. In addition, capitalizing on XHGPs architecture, we investigate the explanations provided by the GNNExplainer and apply counterfactual reasoning to analyze the sensitivity of the model to modifications of the input data. This includes masking scene elements, altering trajectories, and adding or removing dynamic agents. Our proposal advances towards achieving reliable and explainable motion prediction systems, addressing the concerns of users, developers and regulatory agencies alike. The insights gained from our explainability analysis contribute to a better understanding of the relationships between dynamic and static elements in traffic scenarios, facilitating the interpretation of the results, as well as the correction of possible errors in motion prediction models, and thus contributing to the development of trustworthy motion prediction systems.The code to reproduce this work is publicly available at https://github.com/sancarlim/ Explainable-MP/tree/v1.1.
@article{diva2:1821017,
author = {Limeros, Sandra Carrasco and Majchrowska, Sylwia and Johnander Fax\'{e}n, Joakim and Petersson, Christoffer and Llorca, David Fernandez},
title = {{Towards explainable motion prediction using heterogeneous graph representations}},
journal = {Transportation Research Part C},
year = {2023},
volume = {157},
}
Transformer models have achieved outstanding results on a variety of language tasks, such as text classification, ma- chine translation, and question answering. This success in the field of Natural Language Processing (NLP) has sparked interest in the computer vision community to apply these models to vision and multi-modal learning tasks. However, visual data has a unique structure, requiring the need to rethink network designs and training methods. As a result, Transformer models and their variations have been suc- cessfully used for image recognition, object detection, seg- mentation, image super-resolution, video understanding, image generation, text-image synthesis, and visual question answering, among other applications.
@article{diva2:1813849,
author = {Khan, Salman and Khan, Fahad and Vaswani, Ashish and Parmar, Niki and Yang, Ming-Hsuan and Shah, Mubarak},
title = {{Guest Editorial Introduction to the Special Section on Transformer Models in Vision}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2023},
volume = {45},
number = {11},
pages = {12721--12725},
}
The synthesis of high-resolution remote sensing images based on text descriptions has great potential in many practical application scenarios. Although deep neural networks have achieved great success in many important remote sensing tasks, generating realistic remote sensing images from text descriptions is still very difficult. To address this challenge, we propose a novel text-to-image modern Hopfield network (Txt2Img-MHN). The main idea of Txt2Img-MHN is to conduct hierarchical prototype learning on both text and image embeddings with modern Hopfield layers. Instead of directly learning concrete but highly diverse text-image joint feature representations for different semantics, Txt2Img-MHN aims to learn the most representative prototypes from text-image embeddings, achieving a coarse-to-fine learning strategy. These learned prototypes can then be utilized to represent more complex semantics in the text-to-image generation task. To better evaluate the realism and semantic consistency of the generated images, we further conduct zero-shot classification on real remote sensing data using the classification model trained on synthesized images. Despite its simplicity, we find that the overall accuracy in the zero-shot classification may serve as a good metric to evaluate the ability to generate an image from text. Extensive experiments on the benchmark remote sensing text-image dataset demonstrate that the proposed Txt2Img-MHN can generate more realistic remote sensing images than existing methods. Code and pre-trained models are available online ( https://github.com/YonghaoXu/Txt2Img-MHN ).
@article{diva2:1807439,
author = {Xu, Yonghao and Yu, Weikang and Ghamisi, Pedram and Kopp, Michael and Hochreiter, Sepp},
title = {{Txt2Img-MHN: Remote Sensing Image Generation From Text Using Modern Hopfield Networks}},
journal = {IEEE Transactions on Image Processing},
year = {2023},
volume = {32},
pages = {5737--5750},
}
Given the often enormous effort required to train GANs, both computationally as well as in dataset collection, the re-use of pretrained GANs largely increases the potential impact of generative models. Therefore, we propose a novel knowledge transfer method for generative models based on mining the knowledge that is most beneficial to a specific target domain, either from a single or multiple pretrained GANs. This is done using a miner network that identifies which part of the generative distribution of each pretrained GAN outputs samples closest to the target domain. Mining effectively steers GAN sampling towards suitable regions of the latent space, which facilitates the posterior finetuning and avoids pathologies of other methods, such as mode collapse and lack of flexibility. Furthermore, to prevent overfitting on small target domains, we introduce sparse subnetwork selection, that restricts the set of trainable neurons to those that are relevant for the target dataset. We perform comprehensive experiments on several challenging datasets using various GAN architectures (BigGAN, Progressive GAN, and StyleGAN) and show that the proposed method, called MineGAN, effectively transfers knowledge to domains with few target images, outperforming existing methods. In addition, MineGAN can successfully transfer knowledge from multiple pretrained GANs. MineGAN.
@article{diva2:1800106,
author = {Wang, Yaxing and Gonzalez-Garcia, Abel and Wu, Chenshen and Herranz, Luis and Khan, Fahad and Jui, Shangling and Yang, Jian and van de Weijer, Joost},
title = {{MineGAN plus plus : Mining Generative Models for Efficient Knowledge Transfer to Limited Data Domains}},
journal = {International Journal of Computer Vision},
year = {2023},
volume = {132},
number = {2},
pages = {490--514},
}
Following unprecedented success on the natural language tasks, Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results and prompting researchers to reconsider the supremacy of convolutional neural networks (CNNs) as de facto operators. Capitalizing on these advances in computer vision, the medical imaging field has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields. Inspired from this transition, in this survey, we attempt to provide a comprehensive review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues. Specifically, we survey the use of Transformers in medical image segmentation, detection, classification, restoration, synthesis, registration, clinical report generation, and other tasks. In particular, for each of these applications, we develop taxonomy, identify application-specific challenges as well as provide insights to solve them, and highlight recent trends. Further, we provide a critical discussion of the fields current state as a whole, including the identification of key challenges, open problems, and outlining promising future directions. We hope this survey will ignite further interest in the community and provide researchers with an up-to-date reference regarding applications of Transformer models in medical imaging. Finally, to cope with the rapid development in this field, we intend to regularly update the relevant latest papers and their open-source implementations at https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging.
@article{diva2:1791074,
author = {Shamshad, Fahad and Khan, Salman and Zamir, Syed Waqas and Khan, Muhammad Haris and Hayat, Munawar and Khan, Fahad and Fu, Huazhu},
title = {{Transformers in medical imaging: A survey}},
journal = {Medical Image Analysis},
year = {2023},
volume = {88},
}
Predicting the motion of other road agents enables autonomous vehicles to perform safe and efficient path planning. This task is very complex, as the behaviour of road agents depends on many factors and the number of possible future trajectories can be considerable (multi-modal). Most prior approaches proposed to address multi-modal motion prediction are based on complex machine learning systems that have limited interpretability. Moreover, the metrics used in current benchmarks do not evaluate all aspects of the problem, such as the diversity and admissibility of the output. The authors aim to advance towards the design of trustworthy motion prediction systems, based on some of the requirements for the design of Trustworthy Artificial Intelligence. The focus is on evaluation criteria, robustness, and interpretability of outputs. First, the evaluation metrics are comprehensively analysed, the main gaps of current benchmarks are identified, and a new holistic evaluation framework is proposed. Then, a method for the assessment of spatial and temporal robustness is introduced by simulating noise in the perception system. To enhance the interpretability of the outputs and generate more balanced results in the proposed evaluation framework, an intent prediction layer that can be attached to multi-modal motion prediction models is proposed. The effectiveness of this approach is assessed through a survey that explores different elements in the visualisation of the multi-modal trajectories and intentions. The proposed approach and findings make a significant contribution to the development of trustworthy motion prediction systems for autonomous vehicles, advancing the field towards greater safety and reliability.
@article{diva2:1770354,
author = {Limeros, Sandra Carrasco and Majchrowska, Sylwia and Johnander Fax\'{e}n, Joakim and Petersson, Christoffer and Sotelo, Miguel Angel and Llorca, David Fernandez},
title = {{Towards trustworthy multi-modal motion prediction: Holistic evaluation and interpretability of outputs}},
journal = {CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY},
year = {2023},
}
Deep Convolution Neural Networks (CNNs) can easily be fooled by subtle, imperceptible changes to the input images. To address this vulnerability, adversarial training creates perturbation patterns and includes them in the training set to robustify the model. In contrast to existing adversarial training methods that only use class-boundary information (e.g., using a cross-entropy loss), we propose to exploit additional information from the feature space to craft stronger adversaries that are in turn used to learn a robust model. Specifically, we use the style and content information of the target sample from another class, alongside its class-boundary information to create adversarial perturbations. We apply our proposed multi-task objective in a deeply supervised manner, extracting multi-scale feature knowledge to create maximally separating adversaries. Subsequently, we propose a max-margin adversarial training approach that minimizes the distance between source image and its adversary and maximizes the distance between the adversary and the target image. Our adversarial training approach demonstrates strong robustness compared to state-of-the-art defenses, generalizes well to naturally occurring corruptions and data distributional shifts, and retains the models accuracy on clean examples.
@article{diva2:1758313,
author = {Naseer, Muzammal and Khan, Salman and Hayat, Munawar and Khan, Fahad and Porikli, Fatih},
title = {{Stylized Adversarial Defense}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2023},
volume = {45},
number = {5},
pages = {6403--6414},
}
Accurate and robust visual object tracking is one of the most challenging and fundamental computer vision problems. It entails estimating the trajectory of the target in an image sequence, given only its initial location, and segmentation, or its rough approximation in the form of a bounding box. Discriminative Correlation Filters (DCFs) and deep Siamese Networks (SNs) have emerged as dominating tracking paradigms, which have led to significant progress. Following the rapid evolution of visual object tracking in the last decade, this survey presents a systematic and thorough review of more than 90 DCFs and Siamese trackers, based on results in nine tracking benchmarks. First, we present the background theory of both the DCF and Siamese tracking core formulations. Then, we distinguish and comprehensively review the shared as well as specific open research challenges in both these tracking paradigms. Furthermore, we thoroughly analyze the performance of DCF and Siamese trackers on nine benchmarks, covering different experimental aspects of visual tracking: datasets, evaluation metrics, performance, and speed comparisons. We finish the survey by presenting recommendations and suggestions for distinguished open challenges based on our analysis.
@article{diva2:1758288,
author = {Javed, Sajid and Danelljan, Martin and Khan, Fahad and Khan, Muhammad Haris and Felsberg, Michael and Matas, Jiri},
title = {{Visual Object Tracking With Discriminative Filters and Siamese Networks: A Survey and Outlook}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2023},
volume = {45},
number = {5},
pages = {6552--6574},
}
We propose a novel approach to translate unpaired contrast computed tomography (CT) scans to noncontrast CT scans and the other way around. Solving this task has two important applications: (i) to automatically generate contrast CT scans for patients for whom injecting contrast substance is not an option, and (ii) to enhance the alignment between contrast and non-contrast CT by reducing the differences induced by the contrast substance before registration.Our approach is based on cycle-consistent generative adversarial convolutional transformers, for short, CyTran. Our neural model can be trained on unpaired images, due to the integration of a multi-level cycleconsistency loss. Aside from the standard cycle-consistency loss applied at the image level, we propose to apply additional cycle-consistency losses between intermediate feature representations, which enforces the model to be cycle-consistent at multiple representations levels, leading to superior results. To deal with high-resolution images, we design a hybrid architecture based on convolutional and multi-head attention layers. In addition, we introduce a novel data set, Coltea-Lung-CT-100W, containing 100 3D triphasic lung CT scans (with a total of 37,290 images) collected from 100 female patients (there is one examination per patient). Each scan contains three phases (non-contrast, early portal venous, and late arterial), allowing us to perform experiments to compare our novel approach with state-of-the-art methods for image style transfer.Our empirical results show that CyTran outperforms all competing methods. Moreover, we show that CyTran can be employed as a preliminary step to improve a state-of-the-art medical image alignment method. We release our novel model and data set as open source at: https://github.com/ristea/cycletransformer.Our qualitative and subjective human evaluations reveal that CyTran is the only approach that does not introduce visual artifacts during the translation process. We believe this is a key advantage in our application domain, where medical images need to precisely represent the scanned body parts. (c) 2023 Elsevier B.V. All rights reserved.
@article{diva2:1758145,
author = {Ristea, Nicolae-Catalin and Miron, Andreea-Iuliana and Savencu, Olivian and Georgescu, Mariana-Iuliana and Verga, Nicolae and Khan, Fahad and Ionescu, Radu Tudor},
title = {{CyTran: A cycle-consistent transformer with multi-level consistency for non-contrast to contrast CT translation}},
journal = {Neurocomputing},
year = {2023},
volume = {538},
}
A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.
@article{diva2:1746634,
author = {Barbalau, Antonio and Ionescu, Radu Tudor and Georgescu, Mariana-Iuliana and Dueholm, Jacob and Ramachandra, Bharathkumar and Nasrollahi, Kamal and Khan, Fahad and Moeslund, Thomas B. and Shah, Mubarak},
title = {{SSMTL plus plus : Revisiting self-supervised multi-task learning for video anomaly detection}},
journal = {Computer Vision and Image Understanding},
year = {2023},
volume = {229},
}
Video instance segmentation is one of the core problems in computer vision. Formulating a purely learning-based method, which models the generic track management required to solve the video instance segmentation task, is a highly challenging problem. In this work, we propose a novel learning framework where the entire video instance segmentation problem is modeled jointly. To this end, we design a graph neural network that in each frame jointly processes all detections and a memory of previously seen tracks. Past information is considered and processed via a recurrent connection. We demonstrate the effectiveness of the proposed approach in comprehensive experiments. Our approach operates online at over 25 FPS and obtains 16.3 AP on the challenging OVIS benchmark, setting a new state-of-the-art. We further conduct detailed ablative experiments that validate the different aspects of our approach. Code is available at https://github.com/emibr948/RGNNVIS-PlusPlus.
@article{diva2:1714333,
author = {Brissman, Emil and Johnander, Joakim and Danelljan, Martin and Felsberg, Michael},
title = {{Recurrent Graph Neural Networks for Video Instance Segmentation}},
journal = {International Journal of Computer Vision},
year = {2023},
volume = {131},
pages = {471--495},
}
We propose a fast single-stage method for both image and video instance segmentation, called SipMask, that preserves the instance spatial information by performing multiple sub-region mask predictions. The main module in our method is a light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for the sub-regions within a bounding-box, enabling a better delineation of spatially adjacent instances. To better correlate mask prediction with object detection, we further propose a mask alignment weighting loss and a feature alignment scheme. In addition, we identify two issues that impede the performance of single-stage instance segmentation and introduce two modules, including a sample selection scheme and an instance refinement module, to address these two issues. Experiments are performed on both image instance segmentation dataset MS COCO and video instance segmentation dataset YouTube-VIS. On MS COCO test-dev set, our method achieves a state-of-the-art performance. In terms of real-time capabilities, it outperforms YOLACT by a gain of 3.0% (mask AP) under the similar settings, while operating at a comparable speed. On YouTube-VIS validation set, our method also achieves promising results. The source code is available at https://github.com/JialeCao001/SipMask.
@article{diva2:1679024,
author = {Cao, Jiale and Pang, Yanwei and Anwer, Rao Muhammad and Cholakkal, Hisham and Khan, Fahad Shahbaz and Shao, Ling},
title = {{SipMaskv2: Enhanced Fast Image and Video Instance Segmentation}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2023},
volume = {45},
number = {3},
pages = {3798--3812},
}
Introduction of large training datasets was essential for the recent advancement and success of deep learning methods. Due to the difficulties related to biometric data collection, facial image datasets with biometric trait labels are scarce and usually limited in terms of size and sample diversity. Web-scraping approaches for automatic data collection can produce large amounts of weakly labeled and noisy data. This work is focused on picking out the bad apples from web-scraped facial datasets by automatically removing erroneous samples that impair their usability. The unsupervised facial biometric data filtering method presented in this work greatly reduces label noise levels in web-scraped facial biometric data. Experiments on two large state-of-the-art web-scraped datasets demonstrate the effectiveness of the proposed method with respect to real and apparent age estimation based on five different age estimation methods. Furthermore, we apply the proposed method, together with a newly devised strategy for merging multiple datasets, to data collected from three major web-based data sources (i.e., IMDb, Wikipedia, Google) and derive the new Biometrically Filtered Famous Figure Dataset or B3FD. The proposed dataset, which is made publicly available, enables considerable performance gains for all tested age estimation methods and age estimation tasks. This work highlights the importance of training data quality compared to data quantity and selection of the estimation method.
@article{diva2:1634548,
author = {Be\v{s}eni\'{c}, Kre\v{s}imir and Ahlberg, Jörgen and Pandži\'{c}, Igor S.},
title = {{Picking out the bad apples:
unsupervised biometric data filtering for refined age estimation}},
journal = {The Visual Computer},
year = {2023},
volume = {39},
pages = {219--237},
}
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks, e.g., Long short-term memory. Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text, and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers, i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization), and three-dimensional analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges toward the application of transformer models in computer vision.
@article{diva2:1716688,
author = {Khan, Salman and Naseer, Muzammal and Hayat, Munawar and Zamir, Syed Waqas and Khan, Fahad and Shah, Mubarak},
title = {{Transformers in Vision: A Survey}},
journal = {ACM Computing Surveys},
year = {2022},
volume = {54},
number = {10},
}
In recent years, Siamese network based trackers have significantly advanced the state-of-the-art in real-time tracking. Despite their success, Siamese trackers tend to suffer from high memory costs, which restrict their applicability to mobile devices with tight memory budgets. To address this issue, we propose a distilled Siamese tracking framework to learn small, fast and accurate trackers (students), which capture critical knowledge from large Siamese trackers (teachers) by a teacher-students knowledge distillation model. This model is intuitively inspired by the one teacher versus multiple students learning method typically employed in schools. In particular, our model contains a single teacher-student distillation module and a student-student knowledge sharing mechanism. The former is designed using a tracking-specific distillation strategy to transfer knowledge from a teacher to students. The latter is utilized for mutual learning between students to enable in-depth knowledge understanding. Extensive empirical evaluations on several popular Siamese trackers demonstrate the generality and effectiveness of our framework. Moreover, the results on five tracking benchmarks show that the proposed distilled trackers achieve compression rates of up to 18x and frame-rates of 265 FPS, while obtaining comparable tracking accuracy compared to base models.
@article{diva2:1714279,
author = {Shen, Jianbing and Liu, Yuanpei and Dong, Xingping and Lu, Xiankai and Khan, Fahad and Hoi, Steven},
title = {{Distilled Siamese Networks for Visual Tracking}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2022},
volume = {44},
number = {12},
pages = {8896--8909},
}
In a real-world setting, object instances from new classes can be continuously encountered by object detectors. When existing object detectors are applied to such scenarios, their performance on old classes deteriorates significantly. A few efforts have been reported to address this limitation, all of which apply variants of knowledge distillation to avoid catastrophic forgetting. We note that although distillation helps to retain previous learning, it obstructs fast adaptability to new tasks, which is a critical requirement for incremental learning. In this pursuit, we propose a meta-learning approach that learns to reshape model gradients, such that information across incremental tasks is optimally shared. This ensures a seamless information transfer via a meta-learned gradient preconditioning that minimizes forgetting and maximizes knowledge transfer. In comparison to existing meta-learning methods, our approach is task-agnostic, allows incremental addition of new-classes and scales to high-capacity models for object detection. We evaluate our approach on a variety of incremental learning settings defined on PASCAL-VOC and MS COCO datasets, where our approach performs favourably well against state-of-the-art methods. Code and trained models: https://github.com/JosephKJ/iOD.
@article{diva2:1714189,
author = {Joseph, K. J. and Rajasegaran, Jathushan and Khan, Salman and Khan, Fahad and Balasubramanian, Vineeth N.},
title = {{Incremental Object Detection via Meta-Learning}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2022},
volume = {44},
number = {12},
pages = {9209--9216},
}
Adversarial training (AT) is an effective approach to making deep neural networks robust against adversarial attacks. Recently, different AT defenses are proposed that not only maintain a high clean accuracy but also show significant robustness against popular and well-studied adversarial attacks, such as projected gradient descent (PGD). High adversarial robustness can also arise if an attack fails to find adversarial gradient directions, a phenomenon known as "gradient masking." In this work, we analyze the effect of label smoothing on AT as one of the potential causes of gradient masking. We then develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed guided projected gradient attack (G-PGA). Our attack approach is based on a "match and deceive" loss that finds optimal adversarial directions through guidance from a surrogate model. Our modified attack does not require random restarts a large number of attack iterations or a search for optimal step size. Furthermore, our proposed G-PGA is generic, thus it can be combined with an ensemble attack strategy as we demonstrate in the case of auto-attack, leading to efficiency and convergence speed improvements. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
@article{diva2:1689187,
author = {Naseer, Muzammal and Khan, Salman and Porikli, Fatih and Khan, Fahad},
title = {{Guidance Through Surrogate: Toward a Generic Diagnostic Attack}},
journal = {IEEE Transactions on Neural Networks and Learning Systems},
year = {2022},
}
Pedestrian detection is an important but challenging problem in computer vision, especially in human-centric tasks. Over the past decade, significant improvement has been witnessed with the help of handcrafted features and deep features. Here we present a comprehensive survey on recent advances in pedestrian detection. First, we provide a detailed review of single-spectral pedestrian detection that includes handcrafted features based methods and deep features based approaches. For handcrafted features based methods, we present an extensive review of approaches and find that handcrafted features with large freedom degrees in shape and space have better performance. In the case of deep features based approaches, we split them into pure CNN based methods and those employing both handcrafted and CNN based features. We give the statistical analysis and tendency of these methods, where feature enhanced, part-aware, and post-processing methods have attracted main attention. In addition to single-spectral pedestrian detection, we also review multi-spectral pedestrian detection, which provides more robust features for illumination variance. Furthermore, we introduce some related datasets and evaluation metrics, and a deep experimental analysis. We conclude this survey by emphasizing open problems that need to be addressed and highlighting various future directions. Researchers can track an up-to-date list at https://github.com/JialeCao001/PedSurvey.
@article{diva2:1600804,
author = {Cao, Jiale and Pang, Yanwei and Xie, Jin and Khan, Fahad Shahbaz and Shao, Ling},
title = {{From Handcrafted to Deep Features for Pedestrian Detection:
A Survey}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2022},
volume = {44},
number = {9},
pages = {4913--4934},
}
Dual-energy computed tomography (CT) can be used in radiotherapy treatment planning for the calculation of absorbed dose distributions. The aim of this work is to evaluate whether there is room for improvement in the accuracy of the Monoenergetic Plus algorithm by Siemens Healthineers. A Siemens SOMATOM Force scanner was used to scan a cylindrical polymethyl methacrylate phantom with four rod-inserts made of different materials. Images were reconstructed using ADMIRE and processed with Monoenergetic Plus. The resulting CT numbers were compared with tabulated values and values simulated by the proof-of-a-concept algorithm DIRA developed by the authors. Both the Monoenergetic Plus and DIRA algorithms performed well; the accuracy of attenuation coefficients was better than about ±1% at the energy of 70 keV. Compared with DIRA, the worse performance of Monoenergetic Plus was caused by its (i) two-material decomposition to iodine and water and (ii) imperfect suppression of the beam hardening artifact in ADMIRE.
@article{diva2:1604198,
author = {Magnusson, Maria and Sandborg, Michael and Alm Carlsson, Gudrun and Henriksson, Lilian and Carlsson Tedgren, Åsa and Malusek, Alexandr},
title = {{ACCURACY OF CT NUMBERS OBTAINED BY DIRA AND MONOENERGETIC PLUS ALGORITHMS IN DUAL-ENERGY COMPUTED TOMOGRAPHY}},
journal = {Radiation Protection Dosimetry},
year = {2021},
volume = {195},
number = {3-4},
pages = {212--217},
}
Automatic segmentation of bones in computed tomography (CT) images is used for instance in beam hardening correction algorithms where it improves the accuracy of resulting CT numbers. Of special interest are pelvic bones, which—because of their strong attenuation—affect the accuracy of brachytherapy in this region. This work evaluated the performance of the JJ2016 algorithm with the performance of MK2014v2 and JS2018 algorithms; all these algorithms were developed by authors. Visual comparison, and, in the latter case, also Dice similarity coefficients derived from the ground truth were used. It was found that the 3D-based JJ2016 performed better than the 2D-based MK2014v2, mainly because of the more accurate hole filling that benefitted from information in adjacent slices. The neural network-based JS2018 outperformed both traditional algorithms. It was, however, limited to the resolution of 1283 owing to the limited amount of memory in the graphical processing unit (GPU).
@article{diva2:1602153,
author = {Jeuthe, Julius and Sánchez, Jos\'{e} Carlos González and Magnusson, Maria and Sandborg, Michael and Carlsson Tedgren, Åsa and Malusek, Alexandr},
title = {{Semi-Automated 3D Segmentation of Pelvic Region Bones in CT Volumes for the Annotation of Machine Learning Datasets}},
journal = {Radiation Protection Dosimetry},
year = {2021},
volume = {195},
number = {3-4},
pages = {172--176},
}
The choice of the material base to which the material decomposition is performed in dual-energy computed tomography may affect the quality of reconstructed images. The aim of this work is to investigate how the commonly used bases (water, bone), (water, iodine) and (photoelectric effect, Compton scattering) affect the reconstructed linear attenuation coefficient in the case of the Alvarez–Macovski method. The performance of this method is also compared with the performance of the Dual-energy Iterative Reconstruction Algorithm (DIRA). In both cases, the study is performed using simulations. The results show that the Alvarez–Macovski method produced artefacts when iodine was present in the phantom together with human tissues since this method can only work with one doublet. It was shown that these artefacts could be avoided with DIRA using the (water, bone) doublet for tissues and the (water, iodine) doublet for the iodine solution.
@article{diva2:1602016,
author = {Magnusson, Maria and Alm Carlsson, Gudrun and Sandborg, Michael and Carlsson Tedgren, Åsa and Malusek, Alexandr},
title = {{Optimal Selection of Base Materials for Accurate Dual-Energy Computed Tomography:
Comparison Between the Alvarez--Macovski Method and DIRA}},
journal = {Radiation Protection Dosimetry},
year = {2021},
volume = {195},
number = {3-4},
pages = {218--224},
}
We present a novel learning-based approach to graph representations of road networks employing state-of-the-art graph convolutional neural networks. Our approach is applied to realistic road networks of 17 cities from Open Street Map. While edge features are crucial to generate descriptive graph representations of road networks, graph convolutional networks usually rely on node features only. We show that the highly representative edge features can still be integrated into such networks by applying a line graph transformation. We also propose a method for neighborhood sampling based on a topological neighborhood composed of both local and global neighbors. We compare the performance of learning representations using different types of neighborhood aggregation functions in transductive and inductive tasks and in supervised and unsupervised learning. Furthermore, we propose a novel aggregation approach, Graph Attention Isomorphism Network, GAIN. Our results show that GAIN outperforms state-of-the-art methods on the road type classification problem.
@article{diva2:1581263,
author = {Gharaee, Zahra and Kowshik, Shreyas and Stromann, Oliver and Felsberg, Michael},
title = {{Graph representation learning for road type classification}},
journal = {Pattern Recognition},
year = {2021},
volume = {120},
}
Nowadays, airline ticket prices can vary dynamically and significantly for the same flight, even for nearby seats within the same cabin. Customers are seeking to get the lowest price while airlines are trying to keep their overall revenue as high as possible and maximize their profit. Airlines use various kinds of computational techniques to increase their revenue such as demand prediction and price discrimination. From the customer side, two kinds of models are proposed by different researchers to save money for customers: models that predict the optimal time to buy a ticket and models that predict the minimum ticket price. In this paper, we present a review of customer side and airlines side prediction models. Our review analysis shows that models on both sides rely on limited set of features such as historical ticket price data, ticket purchase date and departure date. Features extracted from external factors such as social media data and search engine query are not considered. Therefore, we introduce and discuss the concept of using social media data for ticket/demand prediction. (c) 2019 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
@article{diva2:1575609,
author = {Abdella, Juhar Ahmed and Zaki, N. M. and Shuaib, Khaled and Khan, Fahad},
title = {{Airline ticket price and demand prediction:
A survey}},
journal = {Journal of King Saud University - Computer and Information Sciences},
year = {2021},
volume = {33},
number = {4},
pages = {375--391},
}
Face alignment is the process of determining a face shape given its location and size in an image. It is used as a basis for other facial analysis tasks and for human-machine interaction and augmented reality applications. It is a challenging problem due to the extremely high variability in facial appearance affected by many external (illumination, occlusion, head pose) and internal factors (race, facial expression). However, advances in deep learning combined with domain-related knowledge from previous research recently demonstrated impressive results nearly saturating the unconstrained benchmark data sets. The focus is shifting towards reducing the computational burden of the face alignment models since real-time performance is required for such a highly dynamic task. Furthermore, many applications target devices on the edge with limited computational power which puts even greater emphasis on computational efficiency. We present the latest development in regression-based approaches that have led towards nearly solving the face alignment problem in an unconstrained scenario. Various regression architectures are systematically explored and recent training techniques discussed in the context of face alignment. Finally, a benchmark comparison of the most successful methods is presented, taking into account execution time as well, to provide a comprehensive overview of this dynamic research field. (C) 2020 Elsevier B.V. All rights reserved.
@article{diva2:1501023,
author = {Gogic, Ivan and Ahlberg, Jörgen and Pandzic, Igor S.},
title = {{Regression-based methods for face alignment: A survey}},
journal = {Signal Processing},
year = {2021},
volume = {178},
}
Automatic recognition of an online series of unsegmented actions requires a method for segmentation that determines when an action starts and when it ends. In this paper, a novel approach for recognizing unsegmented actions in online test experiments is proposed. The method uses self-organizing neural networks to build a three-layer cognitive architecture. The unique features of an action sequence are represented as a series of elicited key activations by the first-layer self-organizing map. An average length of a key activation vector is calculated for all action sequences in a training set and adjusted in learning trials to generate input patterns to the second-layer self-organizing map. The pattern vectors are clustered in the second layer, and the clusters are then labeled by an action identity in the third layer neural network. The experiment results show that although the performance drops slightly in online experiments compared to the offline tests, the ability of the proposed architecture to deal with the unsegmented action sequences as well as the online performance makes the system more plausible and practical in real-case scenarios.
@article{diva2:1455370,
author = {Gharaee, Zahra},
title = {{Online recognition of unsegmented actions with hierarchical SOM architecture}},
journal = {Cognitive Processing},
year = {2021},
volume = {22},
pages = {77--91},
}
In this paper, a novel cognitive architecture for action recognition is developed by applying layers of growing grid neural networks. Using these layers makes the system capable of automatically arranging its representational structure. In addition to the expansion of the neural map during the growth phase, the system is provided with a prior knowledge of the input space, which increases the processing speed of the learning phase. Apart from two layers of growing grid networks the architecture is composed of a preprocessing layer, an ordered vector representation layer and a one-layer supervised neural network. These layers are designed to solve the action recognition problem. The first-layer growing grid receives the input data of human actions and the neural map generates an action pattern vector representing each action sequence by connecting the elicited activation of the trained map. The pattern vectors are then sent to the ordered vector representation layer to build the time-invariant input vectors of key activations for the second-layer growing grid. The second-layer growing grid categorizes the input vectors to the corresponding action clusters/sub-clusters and finally the one-layer supervised neural network labels the shaped clusters with action labels. Three experiments using different datasets of actions show that the system is capable of learning to categorize the actions quickly and efficiently. The performance of the growing grid architecture is compared with the results from a system based on Self-Organizing Maps, showing that the growing grid architecture performs significantly superior on the action recognition tasks.
@article{diva2:1437059,
author = {Gharaee, Zahra},
title = {{Hierarchical growing grid networks for skeleton based action recognition}},
journal = {Cognitive Systems Research},
year = {2020},
volume = {63},
pages = {11--29},
}
In many computer vision applications, one acquires images of planar surfaces from two different vantage points. One can use a projective transformation to map pixel coordinates associated with a particular planar surface from one image to another. The transformation, called a homography, can be represented by a unique, to within a scale factor, 3 × 3 matrix. One requires a different homography matrix, scale differences apart, for each planar surface whose two images one wants to relate. However, a collection of homography matrices forms a valid set only if the matrices satisfy consistency constraints implied by the rigidity of the motion and the scene. We explore what it means for a set of homography matrices to be compatible and show that two seemingly disparate definitions are in fact equivalent. Our insight lays the theoretical foundations upon which the derivation of various sets of homography consistency constraints can proceed.
@article{diva2:1436062,
author = {Chojnacki, Wojciech and Szpak, Zygmunt L. and Wadenbäck, Mårten},
title = {{The equivalence of two definitions of compatible homography matrices}},
journal = {Pattern Recognition Letters},
year = {2020},
volume = {135},
pages = {38--43},
}
This letter introduces a framework for evaluation of the losses used in point set registration. In order for a loss to be useful with a local optimizer, such as e.g.& x00A0;Levenberg-Marquardt, or expectation maximization (EM), it must be monotonic with respect to the sought transformation. This motivates us to introduce monotonicity violation probability (MVP) curves, and use these to assess monotonicity empirically for many different local distances, such as point-to-point, point-to-plane, and plane-to-plane. We also introduce a local shape-to-shape distance, based on the Wasserstein distance of the local normal distributions. Evaluation is done on a comprehensive benchmark of terrestrial lidar scans from two publicly available datasets. It demonstrates that matching robustness can be improved significantly, by using kernel versions of local distances together with inverse density based sample weighting.
@article{diva2:1424738,
author = {Tavares, Anderson and Järemo-Lawin, Felix and Forss\'{e}n, Per-Erik},
title = {{Assessing Losses for Point Set Registration}},
journal = {IEEE Robotics and Automation Letters},
year = {2020},
volume = {5},
number = {2},
pages = {3360--3367},
}
Facial expression recognition applications demand accurate and fast algorithms that can run in real time on platforms with limited computational resources. We propose an algorithm that bridges the gap between precise but slow methods and fast but less precise methods. The algorithm combines gentle boost decision trees and neural networks. The gentle boost decision trees are trained to extract highly discriminative feature vectors (local binary features) for each basic facial expression around distinct facial landmark points. These sparse binary features are concatenated and used to jointly optimize facial expression recognition through a shallow neural network architecture. The joint optimization improves the recognition rates of difficult expressions such as fear and sadness. Furthermore, extensive experiments in both within- and cross-database scenarios have been conducted on relevant benchmark data sets for facial expression recognition: CK+, MMI, JAFFE, and SFEW 2.0. The proposed method (LBF-NN) compares favorably with state-of-the-art algorithms while achieving an order of magnitude improvement in execution time.
@article{diva2:1413990,
author = {Gogic, Ivan and Manhart, Martina and Pandzic, Igor S. and Ahlberg, Jörgen},
title = {{Fast facial expression recognition using local binary features and shallow neural networks}},
journal = {The Visual Computer},
year = {2020},
volume = {36},
number = {1},
pages = {97--112},
}
Deep learning algorithms have improved the speed and quality of segmentation for certain tasks in medical imaging. The aim of this work is to design and evaluate an algorithm capable of segmenting bones in dual-energy CT data sets. A convolutional neural network based on the 3D U-Net architecture was implemented and evaluated using high tube voltage images, mixed images and dual-energy images from 30 patients. The network performed well on all the data sets; the mean Dice coefficient for the test data was larger than 0.963. Of special interest is that it performed better on dual-energy CT volumes compared to mixed images that mimicked images taken at 120 kV. The corresponding increase in the Dice coefficient from 0.965 to 0.966 was small since the enhancements were mainly at the edges of the bones. The method can easily be extended to the segmentation of multi-energy CT data.
@article{diva2:1391067,
author = {Sanchez, Jose Carlos Gonzalez and Magnusson, Maria and Sandborg, Michael and Carlsson Tedgren, Åsa and Malusek, Alexandr},
title = {{Segmentation of bones in medical dual-energy computed tomography volumes using the 3D U-Net}},
journal = {Physica medica (Testo stampato)},
year = {2020},
volume = {69},
pages = {241--247},
}
This paper investigates the problem of position estimation of unmanned surface vessels (USVs) operating in coastal areas or in the archipelago. We propose a position estimation method where the horizon line is extracted in a 360 degrees panoramic image around the USV. We design a convolutional neural network (CNN) architecture to determine an approximate horizon line in the image and implicitly determine the camera orientation (the pitch and roll angles). The panoramic image is warped to compensate for the camera orientation and to generate an image from an approximately level camera. A second CNN architecture is designed to extract the pixelwise horizon line in the warped image. The extracted horizon line is correlated with digital elevation model data in the Fourier domain using a minimum output sum of squared error correlation filter. Finally, we determine the location of the maximum correlation score over the search area to estimate the position of the USV. Comprehensive experiments are performed in field trials conducted over 3 days in the archipelago. Our approach provides excellent results by achieving robust position estimates with global positioning system (GPS)-level accuracy in previously unvisited test areas.
@article{diva2:1384261,
author = {Grelsson, Bertil and Robinson, Andreas and Felsberg, Michael and Khan, Fahad},
title = {{GPS-level accurate camera localization with HorizonNet}},
journal = {Journal of Field Robotics},
year = {2020},
volume = {37},
number = {6},
pages = {951--971},
}
Generally, convolutional neural networks (CNNs) process data on a regular grid, e.g. data generated by ordinary cameras. Designing CNNs for sparse and irregularly spaced input data is still an open research problem with numerous applications in autonomous driving, robotics, and surveillance. In this paper, we propose an algebraically-constrained normalized convolution layer for CNNs with highly sparse input that has a smaller number of network parameters compared to related work. We propose novel strategies for determining the confidence from the convolution operation and propagating it to consecutive layers. We also propose an objective function that simultaneously minimizes the data error while maximizing the output confidence. To integrate structural information, we also investigate fusion strategies to combine depth and RGB information in our normalized convolution network framework. In addition, we introduce the use of output confidence as an auxiliary information to improve the results. The capabilities of our normalized convolution network framework are demonstrated for the problem of scene depth completion. Comprehensive experiments are performed on the KITTI-Depth and the NYU-Depth-v2 datasets. The results clearly demonstrate that the proposed approach achieves superior performance while requiring only about 1-5% of the number of parameters compared to the state-of-the-art methods.
@article{diva2:1362784,
author = {Eldesokey, Abdelrahman and Felsberg, Michael and Khan, Fahad Shahbaz},
title = {{Confidence Propagation through CNNs for Guided Sparse Depth Regression}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2020},
volume = {42},
number = {10},
}
Mobile manipulation robots have great potential for roles in support of rescuers on disaster-response missions. Robots can operate in places too dangerous for humans and therefore can assist in accomplishing hazardous tasks while their human operators work at a safe distance. We developed a disaster-response system that consists of the highly flexible Centauro robot and suitable control interfaces, including an immersive telepresence suit and support-operator controls offering different levels of autonomy.
@article{diva2:1382353,
author = {Klamt, Tobias and Rodriguez, Diego and Baccelliere, Lorenzo and Chen, Xi and Chiaradia, Domenico and Cichon, Torben and Gabardi, Massimiliano and Guria, Paolo and Holmquist, Karl and Kamedula, Malgorzata and Karaoguz, Hakan and Kashiri, Navvab and Laurenzi, Arturo and Lenz, Christian and Leonardis, Daniele and Hoffman, Enrico Mingo and Muratore, Luca and Pavlichenko, Dmytro and Porcini, Francesco and Ren, Zeyu and Schilling, Fabian and Schwarz, Max and Solazzi, Massimiliano and Felsberg, Michael and Frisoli, Antonio and Gustmann, Michael and Jensfelt, Patric and Nordberg, Klas and Rossmann, Juergen and Suess, Uwe and Tsagarakis, Nikos G. and Behnke, Sven},
title = {{Flexible Disaster Response of Tomorrow: Final Presentation and Evaluation of the CENTAURO System}},
journal = {IEEE robotics \& automation magazine},
year = {2019},
volume = {26},
number = {4},
pages = {59--72},
}
Quantitative dual-energy computed tomography may improve the accuracy of treatment planning in radiation therapy. Of special interest are algorithms that can estimate material composition of the imaged object. One example of such an algorithm is the 2D model-based iterative reconstruction algorithm DIRA. The aim of this work is to extend this algorithm to 3D so that it can be used with cone-beams and helical scanning. In the new algorithm, the parallel FBP method was replaced with the approximate 3D FBP-based PI-method. Its performance was tested using a mathematical phantom consisting of six ellipsoids. The algorithm substantially reduced the beam-hardening artefact and the artefacts caused by approximate reconstruction after six iterations. Compared to Alvarez-Macovskis base material decomposition, DIRA-3D does not require geometrically consistent projections and hence can be used in dual-source CT scanners. Also, it can use several tissue-specific material bases at the same time to represent the imaged object.
@article{diva2:1367890,
author = {Magnusson, Maria and Björnfot, Magnus and Carlsson Tedgren, Åsa and Alm Carlsson, Gudrun and Sandborg, Michael and Malusek, Alexandr},
title = {{DIRA-3D-a model-based iterative algorithm for accurate dual-energy dual-source 3D helical CT}},
journal = {Biomedical Engineering \& Physics Express},
year = {2019},
volume = {5},
number = {6},
}
This paper revisits the problem of continuous-time structure from motion, and introduces a number of extensions that improve convergence and efficiency. The formulation with a C2-continuous spline for the trajectory naturally incorporates inertial measurements, as derivatives of the sought trajectory. We analyze the behavior of split spline interpolation on SO(3) and on R3, and a joint spline on SE(3), and show that the latter implicitly couples the direction of translation and rotation. Such an assumption can make good sense for a camera mounted on a robot arm, but not for hand-held or body-mounted cameras. Our experiments in the Spline Fusion framework show that a split spline on R3andSO(3) is preferable over an SE(3) spline in all tested cases. Finally, we investigate the problem of landmark reprojection on rolling shutter cameras, and show that the tested reprojection methods give similar quality, whereas their computational load varies by a factor of two.
@article{diva2:1333843,
author = {Ovr\'{e}n, Hannes and Forss\'{e}n, Per-Erik},
title = {{Trajectory representation and landmark projection for continuous-time structure from motion}},
journal = {The international journal of robotics research},
year = {2019},
volume = {38},
number = {6},
pages = {686--701},
}
The usage of both off-the-shelf and end-to-end trained deep networks have significantly improved the performance of visual tracking on RGB videos. However, the lack of large labeled datasets hampers the usage of convolutional neural networks for tracking in thermal infrared (TIR) images. Therefore, most state-of-the-art methods on tracking for TIR data are still based on handcrafted features. To address this problem, we propose to use image-to-image translation models. These models allow us to translate the abundantly available labeled RGB data to synthetic TIR data. We explore both the usage of paired and unpaired image translation models for this purpose. These methods provide us with a large labeled dataset of synthetic TIR sequences, on which we can train end-to-end optimal features for tracking. To the best of our knowledge, we are the first to train end-to-end features for TIR tracking. We perform extensive experiments on the VOT-TIR2017 dataset. We show that a network trained on a large dataset of synthetic TIR data obtains better performance than one trained on the available real TIR data. Combining both data sources leads to further improvement. In addition, when we combine the network with motion features, we outperform the state of the art with a relative gain of over 10%, clearly showing the efficiency of using synthetic data to train end-to-end TIR trackers.
@article{diva2:1274664,
author = {Zhang, Lichao and Gonzalez-Garcia, Abel and van de Weijer, Joost and Danelljan, Martin and Khan, Fahad},
title = {{Synthetic Data Generation for End-to-End Thermal Infrared Tracking}},
journal = {IEEE Transactions on Image Processing},
year = {2019},
volume = {28},
number = {4},
pages = {1837--1850},
}
Current best local descriptors are learned on a large data set of matching and non-matching keypoint pairs. However, data of this kind are not always available, since the detailed keypoint correspondences can be hard to establish. On the other hand, we can often obtain labels for pairs of keypoint bags. For example, keypoint bags extracted from two images of the same object under different views form a matching pair, and keypoint bags extracted from images of different objects form a non-matching pair. On average, matching pairs should contain more corresponding keypoints than non-matching pairs. We describe an end-to-end differentiable architecture that enables the learning of local keypoint descriptors from such weakly labeled data. In addition, we discuss how to improve the method by incorporating the procedure of mining hard negatives. We also show how our approach can be used to learn convolutional features from unlabeled video signals and 3D models.
@article{diva2:1256386,
author = {Markus, Nenad and Pandzic, Igor S. and Ahlberg, Jörgen},
title = {{Learning Local Descriptors by Optimizing the Keypoint-Correspondence Criterion: Applications to Face Matching, Learning From Unlabeled Videos and 3D-Shape Retrieval}},
journal = {IEEE Transactions on Image Processing},
year = {2019},
volume = {28},
number = {1},
pages = {279--290},
}
Generic visual tracking is a challenging computer vision problem, with numerous applications. Most existing approaches rely on appearance information by employing either hand-crafted features or deep RGB features extracted from convolutional neural networks. Despite their success, these approaches struggle in case of ambiguous appearance information, leading to tracking failure. In such cases, we argue that motion cue provides discriminative and complementary information that can improve tracking performance. Contrary to visual tracking, deep motion features have been successfully applied for action recognition and video classification tasks. Typically, the motion features are learned by training a CNN on optical flow images extracted from large amounts of labeled videos. In this paper, we investigate the impact of deep motion features in a tracking-by-detection framework. We also evaluate the fusion of hand-crafted, deep RGB, and deep motion features and show that they contain complementary information. To the best of our knowledge, we are the first to propose fusing appearance information with deep motion features for visual tracking. Comprehensive experiments clearly demonstrate that our fusion approach with deep motion features outperforms standard methods relying on appearance information alone.
@article{diva2:1209805,
author = {Danelljan, Martin and Bhat, Goutam and Gladh, Susanna and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Deep motion and appearance cues for visual tracking}},
journal = {Pattern Recognition Letters},
year = {2019},
volume = {124},
pages = {74--81},
}
We combine the near-sensor image processing concept with address-event representation leading to an intensity-ranking image sensor (IRIS) and show the benefits of using this type of sensor for image classification. The functionality of IRIS is to output pixel coordinates (X and Y values) continuously as each pixel has collected a certain number of photons. Thus, the pixel outputs will be automatically intensity ranked. By keeping track of the timing of these events, it is possible to record the full dynamic range of the image. However, in many cases, this is not necessary-the intensity ranking in itself gives the needed information for the task at hand. This paper describes techniques for classification and proposes a particular variant (groves) that fits the IRIS architecture well as it can work on the intensity rankings only. Simulation results using the CIFAR-10 dataset compare the results of the proposed method with the more conventional ferns technique. It is concluded that the simultaneous sensing and classification obtainable with the IRIS sensor yields both fast (shorter than full exposure time) and processing-efficient classification.
@article{diva2:1254020,
author = {Ahlberg, Jörgen and Åstrom, Anders and Forchheimer, Robert},
title = {{Simultaneous sensing, readout, and classification on an intensity-ranking image sensor}},
journal = {International journal of circuit theory and applications},
year = {2018},
volume = {46},
number = {9},
pages = {1606--1619},
}
Designing discriminative powerful texture features robust to realistic imaging conditions is a challenging computer vision problem with many applications, including material recognition and analysis of satellite or aerial imagery. In the past, most texture description approaches were based on dense orderless statistical distribution of local features. However, most recent approaches to texture recognition and remote sensing scene classification are based on Convolutional Neural Networks (CNNs). The de facto practice when learning these CNN models is to use RGB patches as input with training performed on large amounts of labeled data (ImageNet). In this paper, we show that Local Binary Patterns (LBP) encoded CNN models, codenamed TEX-Nets, trained using mapped coded images with explicit LBP based texture information provide complementary information to the standard RGB deep models. Additionally, two deep architectures, namely early and late fusion, are investigated to combine the texture and color information. To the best of our knowledge, we are the first to investigate Binary Patterns encoded CNNs and different deep network fusion architectures for texture recognition and remote sensing scene classification. We perform comprehensive experiments on four texture recognition datasets and four remote sensing scene classification benchmarks: UC-Merced with 21 scene categories, WHU-RS19 with 19 scene classes, RSSCN7 with 7 categories and the recently introduced large scale aerial image dataset (AID) with 30 aerial scene types. We demonstrate that TEX-Nets provide complementary information to standard RGB deep model of the same network architecture. Our late fusion TEX-Net architecture always improves the overall performance compared to the standard RGB network on both recognition problems. Furthermore, our final combination leads to consistent improvement over the state-of-the-art for remote sensing scene classification. (C) 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
@article{diva2:1209495,
author = {Anwer, Rao Muhammad and Khan, Fahad and van de Weijer, Joost and Molinier, Matthieu and Laaksonen, Jorma},
title = {{Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification}},
journal = {ISPRS journal of photogrammetry and remote sensing (Print)},
year = {2018},
volume = {138},
pages = {74--85},
}
Color description is one of the fundamental problems of image understanding. One of the popular ways to represent colors is by means of color names. Most existing work on color names focuses on only the eleven basic color terms of the English language. This could be limiting the discriminative power of these representations, and representations based on more color names are expected to perform better. However, there exists no clear strategy to choose additional color names. We collect a dataset of 28 additional color names. To ensure that the resulting color representation has high discriminative power we propose a method to order the additional color names according to their complementary nature with the basic color names. This allows us to compute color name representations with high discriminative power of arbitrary length. In the experiments we show that these new color name descriptors outperform the existing color name descriptor on the task of visual tracking, person re-identification and image classification.
@article{diva2:1188346,
author = {Yu, Lu and Zhang, Lichao and van de Weijer, Joost and Khan, Fahad and Cheng, Yongmei and Alejandro Parraga, C.},
title = {{Beyond Eleven Color Names for Image Understanding}},
journal = {Machine Vision and Applications},
year = {2018},
volume = {29},
number = {2},
pages = {361--373},
}
Most approaches to human attribute and action recognition in still images are based on image representation in which multi-scale local features are pooled across scale into a single, scale-invariant encoding. Both in bag-of-words and the recently popular representations based on convolutional neural networks, local features are computed at multiple scales. However, these multi-scale convolutional features are pooled into a single scale-invariant representation. We argue that entirely scale-invariant image representations are sub-optimal and investigate approaches to scale coding within a bag of deep features framework. Our approach encodes multi-scale information explicitly during the image encoding stage. We propose two strategies to encode multi-scale information explicitly in the final image representation. We validate our two scale coding techniques on five datasets: Willow, PASCAL VOC 2010, PASCAL VOC 2012, Stanford-40 and Human Attributes (HAT-27). On all datasets, the proposed scale coding approaches outperform both the scale-invariant method and the standard deep features of the same network. Further, combining our scale coding approaches with standard deep features leads to consistent improvement over the state of the art.
@article{diva2:1176581,
author = {Khan, Fahad and van de Weijer, Joost and Muhammad Anwer, Rao and Bagdanov, Andrew D. and Felsberg, Michael and Laaksonen, Jorma},
title = {{Scale coding bag of deep features for human attribute and action recognition}},
journal = {Machine Vision and Applications},
year = {2018},
volume = {29},
number = {1},
pages = {55--71},
}
Most methods that address computer vision prob-lems require powerful visual features. Many successfulapproaches apply techniques motivated from nonparametricstatistics. The channel representation provides a frameworkfornonparametricdistributionrepresentation.Althoughearlywork has focused on a signal processing view of the rep-resentation, the channel representation can be interpretedin probabilistic terms, e.g., representing the distribution oflocal image orientation. In this paper, a variety of approxi-mative channel-based algorithms for probabilistic problemsare presented: a novel efficient algorithm for density recon-struction, a novel and efficient scheme for nonlinear griddingof densities, and finally a novel method for estimating Copuladensities. The experimental results provide evidence that byrelaxing the requirements for exact solutions, efficient algo-rithms are obtained
@article{diva2:1159593,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Approximative Coding Methods for Channel Representations}},
journal = {Journal of Mathematical Imaging and Vision},
year = {2018},
volume = {60},
number = {6},
pages = {929--940},
}
Visual odometry using only a monocular camera faces more algorithmic challenges than stereo odometry. We present a robust monocular visual odometry framework for automotive applications. An extended propagation-based tracking framework is proposed which yields highly accurate (unscaled) pose estimates. Scale is supplied by ground plane pose estimation employing street pixel labeling using a convolutional neural network (CNN). The proposed framework has been extensively tested on the KITTI dataset and achieves a higher rank than current published state-of-the-art monocular methods in the KITTI odometry benchmark. Unlike other VO/SLAM methods, this result is achieved without loop closing mechanism, without RANSAC and also without multiframe bundle adjustment. Thus, we challenge the common belief that robust systems can only be built using iterative robustification tools like RANSAC. (C) 2017 Published by Elsevier B.V.
@article{diva2:1176566,
author = {Fanani, Nolang and Stuerck, Alina and Ochs, Matthias and Bradler, Henry and Mester, Rudolf},
title = {{Predictive monocular odometry (PMO): What is possible without RANSAC and multiframe bundle adjustment?}},
journal = {Image and Vision Computing},
year = {2017},
volume = {68},
}
Purpose: To develop and evaluate-in a proof-of-concept configuration-a novel iterative reconstruction algorithm (DIRA) for quantitative determination of elemental composition of patient tissues for application to brachytherapy with low energy (amp;lt; 50 keV) photons and proton therapy. Methods: DIRA was designed as a model-based iterative reconstruction algorithm, which uses filtered backprojection, automatic segmentation and multimaterial tissue decomposition. The evaluation was done for a phantom derived from the voxelized ICRP 110 male phantom. Soft tissues were decomposed to the lipid, protein and water triplet, bones were decomposed to the compact bone and bone marrow doublet. Projections were derived using the Drasim simulation code for an axial scanning configuration resembling a typical DECT (dual-energy CT) scanner with 80 kV and Sn140 kV x-ray spectra. The iterative loop produced mono-energetic images at 50 and 88 keV without beam hardening artifacts. Different noise levels were considered: no noise, a typical noise level in diagnostic imaging and reduced noise level corresponding to tenfold higher doses. An uncertainty analysis of the results was performed using type A and B evaluations. The two approaches were compared. Results: Linear attenuation coefficients averaged over a region were obtained with relative errors less than 0.5% for all evaluated regions. Errors in average mass fractions of the three-material decomposition were less than 0.04 for no noise and reduced noise levels and less than 0.11 for the typical noise level. Mass fractions of individual pixels were strongly affected by noise, which slightly increased after the first iteration but subsequently stabilized. Estimates of uncertainties in mass fractions provided by the type B evaluation differed from the type A estimates by less than 1.5% for most cases. The algorithm was fast, the results converged after 5 iterations. The algorithmic complexity of forward polyenergetic projection calculation was much reduced by using material doublets and triplets. Conclusions: The simulations indicated that DIRA is capable of determining elemental composition of tissues, which are needed in brachytherapy with low energy (amp;lt; 50 keV) photons and proton therapy. The algorithm provided quantitative monoenergetic images with beam hardening artifacts removed. Its convergence was fast, image sharpness expressed via the modulation transfer function was maintained, and image noise did not increase with the number of iterations. c 2017 American Association of Physicists in Medicine
@article{diva2:1140801,
author = {Malusek, Alexandr and Magnusson, Maria and Sandborg, Michael and Alm Carlsson, Gudrun},
title = {{A model-based iterative reconstruction algorithm DIRA using patient-specific tissue classification via DECT for improved quantitative CT in dose planning}},
journal = {Medical physics (Lancaster)},
year = {2017},
volume = {44},
number = {6},
pages = {2345--2357},
}
Privacy protection may be defined as replacing the original content in an image region with a new (less intrusive) content having modified target appearance information to make it less recognizable by applying a privacy protection technique. Indeed the development of privacy protection techniques needs also to be complemented with an established objective evaluation method to facilitate their assessment and comparison. Generally, existing evaluation methods rely on the use of subjective judgements or assume a specific target type in image data and use target detection and recognition accuracies to assess privacy protection. This work proposes a new annotation-free evaluation method that is neither subjective nor assumes a specific target type. It assesses two key aspects of privacy protection: protection and utility. Protection is quantified as an appearance similarity and utility is measured as a structural similarity between original and privacy-protected image regions. We performed an extensive experimentation using six challenging datasets (having 12 video sequences) including a new dataset (having six sequences) that contains visible and thermal imagery. The new dataset, called TST-Priv, is made available online below for community. We demonstrate effectiveness of the proposed method by evaluating six image-based privacy protection techniques, and also show comparisons of the proposed method over existing methods.
@article{diva2:1138417,
author = {Nawaz, Tahir and Berg, Amanda and Ferryman, James and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Effective evaluation of privacy protection techniques in visible and thermal imagery}},
journal = {Journal of Electronic Imaging (JEI)},
year = {2017},
volume = {26},
number = {5},
}
Mammalian herbivores have important top-down effects on ecological processes and landscapes by generating vegetation changes through grazing and trampling. For free-ranging herbivores on large landscapes, trampling is an important ecological factor. However, whereas grazing is widely studied, low-intensity trampling is rarely studied and quantified. The cold-adapted northern tundra reindeer (Rangifer tarandus) is a wide-ranging keystone herbivore in large open alpine and Arctic ecosystems. Reindeer may largely subsist on different species of slow-growing ground lichens, particularly in winter. Lichen grows in dry, snow-poor habitats with frost. Their varying elasticity makes them suitable for studying trampling. In replicated factorial experiments, high-resolution 3D laser scanning was used to quantify lichen volume loss from trampling by a reindeer hoof. Losses were substantial, that is, about 0.3 dm3 per imprint in dry thick lichen, but depended on type of lichen mat and humidity. Immediate trampling volume loss was about twice as high in dry, compared to humid thin (2–3 cm), lichen mats and about three times as high in dry vs. humid thick (6–8 cm) lichen mats, There was no significant difference in volume loss between 100% and 50% wetted lichen. Regained volume with time was insignificant for dry lichen, whereas 50% humid lichen regained substantial volumes, and 100% humid lichen regained almost all lost volume, and mostly within 10–20 min. Reindeer trampling may have from near none to devastating effects on exposed lichen forage. During a normal week of foraging, daily moving 5 km across dry 6- to 8-cm-thick continuous lichen mats, one adult reindeer may trample a lichen volume corresponding to about a year's supply of lichen. However, the lichen humidity appears to be an important factor for trampling loss, in addition to the extent of reindeer movement.
@article{diva2:1136863,
author = {Heggenes, Jan and Odland, Arvid and Chevalier, Tomas and Ahlberg, Jörgen and Berg, Amanda and Larsson, Håkan and Bjerketvedt, Dag},
title = {{Herbivore grazing--or trampling? Trampling effects by a large ungulate in cold high-latitude ecosystems}},
journal = {Ecology and Evolution},
year = {2017},
volume = {7},
number = {16},
pages = {6423--6431},
}
Accurate scale estimation of a target is a challenging research problem in visual object tracking. Most state-of-the-art methods employ an exhaustive scale search to estimate the target size. The exhaustive search strategy is computationally expensive and struggles when encountered with large scale variations. This paper investigates the problem of accurate and robust scale estimation in a tracking-by-detection framework. We propose a novel scale adaptive tracking approach by learning separate discriminative correlation filters for translation and scale estimation. The explicit scale filter is learned online using the target appearance sampled at a set of different scales. Contrary to standard approaches, our method directly learns the appearance change induced by variations in the target scale. Additionally, we investigate strategies to reduce the computational cost of our approach. Extensive experiments are performed on the OTB and the VOT2014 datasets. Compared to the standard exhaustive scale search, our approach achieves a gain of 2.5 percent in average overlap precision on the OTB dataset. Additionally, our method is computationally efficient, operating at a 50 percent higher frame rate compared to the exhaustive scale search. Our method obtains the top rank in performance by outperforming 19 state-of-the-art trackers on OTB and 37 state-of-the-art trackers on VOT2014.
@article{diva2:1129861,
author = {Danelljan, Martin and Häger, Gustav and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Discriminative Scale Space Tracking}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2017},
volume = {39},
number = {8},
pages = {1561--1575},
}
In this work, we introduce a novel tensor-based functional for targeted image enhancement and denoising. Via explicit regularization, our formulation incorporates application-dependent and contextual information using first principles. Few works in literature treat variational models that describe both application-dependent information and contextual knowledge of the denoising problem. We prove the existence of a minimizer and present results on tensor symmetry constraints, convexity, and geometric interpretation of the proposed functional. We show that our framework excels in applications where nonlinear functions are present such as in gamma correction and targeted value range filtering. We also study general denoising performance where we show comparable results to dedicated PDE-based state-of-the-art methods.
@article{diva2:1089909,
author = {Åström, Freddie and Felsberg, Michael and Baravdish, George},
title = {{Mapping-Based Image Diffusion}},
journal = {Journal of Mathematical Imaging and Vision},
year = {2017},
volume = {57},
number = {3},
pages = {293--323},
}
We address two problems related to large-scale aerial monitoring of district heating networks. First, we propose a classification scheme to reduce the number of false alarms among automatically detected leakages in district heating networks. The leakages are detected in images captured by an airborne thermal camera, and each detection corresponds to an image region with abnormally high temperature. This approach yields a significant number of false positives, and we propose to reduce this number in two steps; by (a) using a building segmentation scheme in order to remove detections on buildings, and (b) to use a machine learning approach to classify the remaining detections as true or false leakages. We provide extensive experimental analysis on real-world data, showing that this post-processing step significantly improves the usefulness of the system. Second, we propose a method for characterization of leakages over time, i.e., repeating the image acquisition one or a few years later and indicate areas that suffer from an increased energy loss. We address the problem of finding trends in the degradation of pipe networks in order to plan for long-term maintenance, and propose a visualization scheme exploiting the consecutive data collections.
@article{diva2:1054676,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Enhanced analysis of thermographic images for monitoring of district heat pipe networks}},
journal = {Pattern Recognition Letters},
year = {2016},
volume = {83},
number = {2},
pages = {215--223},
}
1 Migratory songbirds carry an inherited capacity to migrate several thousand kilometers each year crossing continental landmasses and barriers between distant breeding sites and wintering areas. How individual songbirds manage with extreme precision to find their way is still largely unknown. The functional characteristics of biological compasses used by songbird migrants has mainly been investigated by recording the birds directed migratory activity in circular cages, so-called Emlen funnels. This method is 50 years old and has not received major updates over the past decades. The aim of this work was to compare the results from newly developed digital methods with the established manual methods to evaluate songbird migratory activity and orientation in circular cages. 2 We performed orientation experiments using the European robin (Erithacus rubecula) using modified Emlen funnels equipped with thermal paper and simultaneously recorded the songbird movements from above. We evaluated and compared the results obtained with five different methods. Two methods have been commonly used in songbirds orientation experiments; the other three methods were developed for this study and were based either on evaluation of the thermal paper using automated image analysis, or on the analysis of videos recorded during the experiment. 3 The methods used to evaluate scratches produced by the claws of birds on the thermal papers presented some differences compared with the video analyses. These differences were caused mainly by differences in scatter, as any movement of the bird along the sloping walls of the funnel was recorded on the thermal paper, whereas video evaluations allowed us to detect single takeoff attempts by the birds and to consider only this behavior in the orientation analyses. Using computer vision, we were also able to identify and separately evaluate different behaviors that were impossible to record by the thermal paper. 4 The traditional Emlen funnel is still the most used method to investigate compass orientation in songbirds under controlled conditions. However, new numerical image analysis techniques provide a much higher level of detail of songbirds migratory behavior and will provide an increasing number of possibilities to evaluate and quantify specific behaviors as new algorithms will be developed.
@article{diva2:1046409,
author = {Bianco, Giuseppe and Ilieva, Mihaela and Veibäck, Clas and Öfjäll, Kristoffer and Gadomska, Alicja and Hendeby, Gustaf and Felsberg, Michael and Gustafsson, Fredrik and Åkesson, Susanne},
title = {{Emlen funnel experiments revisited: methods update for studying compass orientation in songbirds}},
journal = {Ecology and Evolution},
year = {2016},
volume = {6},
number = {19},
pages = {6930--6942},
}
New paradigms for parallel programming have been devised to simplify software development on multi-core processors and many-core graphical processing units (GPU). Despite their obvious benefits, the parallelisation of existing computer programs is not an easy task. In this work, the use of the Open Multiprocessing (OpenMP) and Open Computing Language (OpenCL) frameworks is considered for the parallelisation of the model-based iterative reconstruction algorithm DIRA with the aim to significantly shorten the code’s execution time. Selected routines were parallelised using OpenMP and OpenCL libraries; some routines were converted from MATLAB to C and optimised. Parallelisation of the code with the OpenMP was easy and resulted in an overall speedup of 15 on a 16-core computer. Parallelisation with OpenCL was more difficult owing to differences between the central processing unit and GPU architectures. The resulting speedup was substantially lower than the theoretical peak performance of the GPU; the cause was explained.
@article{diva2:875757,
author = {Örtenberg, Alexander and Magnusson, Maria and Sandborg, Michael and Alm Carlsson, Gudrun and Malusek, Alexandr},
title = {{PARALLELISATION OF THE MODEL-BASED ITERATIVE RECONSTRUCTION ALGORITHM DIRA}},
journal = {Radiation Protection Dosimetry},
year = {2016},
volume = {169},
number = {1-4},
pages = {405--409},
}
Advanced model-based iterative reconstruction algorithms in quantitative computed tomography (CT) perform automatic segmentation of tissues to estimate material properties of the imaged object. Compared with conventional methods, these algorithms may improve quality of reconstructed images and accuracy of radiation treatment planning. Automatic segmentation of tissues is, however, a difficult task. The aim of this work was to develop and evaluate an algorithm that automatically segments tissues in CT images of the male pelvis. The newly developed algorithm (MK2014) combines histogram matching, thresholding, region growing, deformable model and atlas-based registration techniques for the segmentation of bones, adipose tissue, prostate and muscles in CT images. Visual inspection of segmented images showed that the algorithm performed well for the five analysed images. The tissues were identified and outlined with accuracy sufficient for the dual-energy iterative reconstruction algorithm whose aim is to improve the accuracy of radiation treatment planning in brachytherapy of the prostate.
@article{diva2:875370,
author = {Kardell, Martin and Magnusson, Maria and Sandborg, Michael and Alm Carlsson, Gudrun and Jeuthe, Julius and Malusek, Alexandr},
title = {{AUTOMATIC SEGMENTATION OF PELVIS FOR BRACHYTHERAPYOF PROSTATE}},
journal = {Radiation Protection Dosimetry},
year = {2016},
volume = {169},
number = {1-4},
pages = {398--404},
}
Attitude (pitch and roll angle) estimation from visual information is necessary for GPS-free navigation of airborne vehicles. We propose a highly accurate method to estimate the attitude by horizon detection in fisheye images. A Canny edge detector and a probabilistic Hough voting scheme are used to compute an approximate attitude and the corresponding horizon line in the image. Horizon edge pixels are extracted in a band close to the approximate horizon line. The attitude estimates are refined through registration of the extracted edge pixels with the geometrical horizon from a digital elevation map (DEM), in our case the SRTM3 database, extracted at a given approximate position. The proposed method has been evaluated using 1629 images from a flight trial with flight altitudes up to 600 m in an area with ground elevations ranging from sea level up to 500 m. Compared with the ground truth from a filtered inertial measurement unit (IMU)/GPS solution, the standard deviation for the pitch and roll angle errors obtained with 30 Mpixel images are 0.04° and 0.05°, respectively, with mean errors smaller than 0.02°. To achieve the high-accuracy attitude estimates, the ray refraction in the earth's atmosphere has been taken into account. The attitude errors obtained on real images are less or equal to those achieved on synthetic images for previous methods with DEM refinement, and the errors are about one order of magnitude smaller than for any previous vision-based method without DEM refinement.
@article{diva2:729541,
author = {Grelsson, Bertil and Felsberg, Michael and Isaksson, Folke},
title = {{Highly Accurate Attitude Estimation via Horizon Detection}},
journal = {Journal of Field Robotics},
year = {2016},
volume = {33},
number = {7},
pages = {967--993},
}
Action recognition in still images is a challenging problem in computer vision. To facilitate comparative evaluation independently of person detection, the standard evaluation protocol for action recognition uses an oracle person detector to obtain perfect bounding box information at both training and test time. The assumption is that, in practice, a general person detector will provide candidate bounding boxes for action recognition. In this paper, we argue that this paradigm is suboptimal and that action class labels should already be considered during the detection stage. Motivated by the observation that body pose is strongly conditioned on action class, we show that: 1) the existing state-of-the-art generic person detectors are not adequate for proposing candidate bounding boxes for action classification; 2) due to limited training examples, the direct training of action-specific person detectors is also inadequate; and 3) using only a small number of labeled action examples, the transfer learning is able to adapt an existing detector to propose higher quality bounding boxes for subsequent action classification. To the best of our knowledge, we are the first to investigate transfer learning for the task of action-specific person detection in still images. We perform extensive experiments on two benchmark data sets: 1) Stanford-40 and 2) PASCAL VOC 2012. For the action detection task (i.e., both person localization and classification of the action performed), our approach outperforms methods based on general person detection by 5.7% mean average precision (MAP) on Stanford-40 and 2.1% MAP on PASCAL VOC 2012. Our approach also significantly outperforms the state of the art with a MAP of 45.4% on Stanford-40 and 31.4% on PASCAL VOC 2012. We also evaluate our action detection approach for the task of action classification (i.e., recognizing actions without localizing them). For this task, our approach, without using any ground-truth person localization at test time, outperforms on both data sets state-of-the-art methods, which do use person locations.
@article{diva2:855148,
author = {Khan, Fahad and Xu, Jiaolong and van de Weijer, Joost and Bagdanov, Andrew D. and Muhammad Anwer, Rao and Lopez, Antonio M.},
title = {{Recognizing Actions Through Action-Specific Person Detection}},
journal = {IEEE Transactions on Image Processing},
year = {2015},
volume = {24},
number = {11},
pages = {4422--4432},
}
Visual feature descriptors are essential elements in most computer and robot vision systems. They typically lead to an abstraction of the input data, images, or video, for further processing, such as clustering and machine learning. In clustering applications, the cluster center represents the prototypical descriptor of the cluster and estimates the corresponding signal value, such as color value or dominating flow orientation, by decoding the prototypical descriptor. Machine learning applications determine the relevance of respective descriptors and a visualization of the corresponding decoded information is very useful for the analysis of the learning algorithm. Thus decoding of feature descriptors is a relevant problem, frequently addressed in recent work. Also, the human brain represents sensorimotor information at a suitable abstraction level through varying activation of neuron populations. In previous work, computational models have been derived that agree with findings of neurophysiological experiments on the represen-tation of visual information by decoding the underlying signals. However, the represented variables have a bias toward centers or boundaries of the tuning curves. Despite the fact that feature descriptors in computer vision are motivated from neuroscience, the respec-tive decoding methods have been derived largely independent. From first principles, we derive unbiased decoding schemes for biologically motivated feature descriptors with a minimum amount of redundancy and suitable invariance properties. These descriptors establish a non-parametric density estimation of the underlying stochastic process with a particular algebraic structure. Based on the resulting algebraic constraints, we show formally how the decoding problem is formulated as an unbiased maximum likelihood estimator and we derive a recurrent inverse diffusion scheme to infer the dominating mode of the distribution. These methods are evaluated in experiments, where stationary points and bias from noisy image data are compared to existing methods.
@article{diva2:850261,
author = {Felsberg, Michael and Öfjäll, Kristoffer and Lenz, Reiner},
title = {{Unbiased decoding of biologically motivated visual feature descriptors}},
journal = {Frontiers in Robotics and AI},
year = {2015},
volume = {2},
number = {20},
}
An image mosaic is an assembly of a large number of small images, usually called tiles, taken from a specific dictionary/codebook. When viewed as a whole, the appearance of a single large image emerges, i.e. each tile approximates a small block of pixels. ASCII art is a related (and older) graphic design technique for producing images from printable characters. Although automatic procedures for both of these visualization schemes have been studied in the past, some are computationally heavy and cannot offer real-time and interactive performance. We propose an algorithm able to reproduce the quality of existing non-photorealistic rendering techniques, in particular ASCII art and image mosaics, obtaining large performance speed-ups. The basic idea is to partition the input image into a rectangular grid and use a decision tree to assign a tile from a pre-determined codebook to each cell. Our implementation can process video streams from webcams in real time and it is suitable for modestly equipped devices. We evaluate our technique by generating the renderings of a variety of images and videos, with good results. The source code of our engine is publicly available.
@article{diva2:845458,
author = {Marku\v{s}, Nenad and Fratarcangeli, Marco and Pandži\'{c}, Igor and Ahlberg, Jörgen},
title = {{Fast Rendering of Image Mosaics and ASCII Art}},
journal = {Computer graphics forum (Print)},
year = {2015},
volume = {34},
number = {6},
pages = {251--261},
}
In this study, we investigate the backward p(x)-parabolic equation as a new methodology to enhance images. We propose a novel iterative regularization procedure for the backward p(x)-parabolic equation based on the nonlinear Landweber method for inverse problems. The proposed scheme can also be extended to the family of iterative regularization methods involving the nonlinear Landweber method. We also investigate the connection between the variable exponent p(x) in the proposed energy functional and the diffusivity function in the corresponding Euler-Lagrange equation. It is well known that the forward problems converges to a constant solution destroying the image. The purpose of the approach of the backward problems is twofold. First, solving the backward problem by a sequence of forward problems we obtain a smooth image which is denoised. Second, by choosing the initial data properly we try to reduce the blurriness of the image. The numerical results for denoising appear to give improvement over standard methods as shown by preliminary results.
@article{diva2:758277,
author = {Baravdish, George and Svensson, Olof and Åström, Freddie},
title = {{On Backward \emph{p}(\emph{x})-Parabolic Equations for Image Enhancement}},
journal = {Numerical Functional Analysis and Optimization},
year = {2015},
volume = {36},
number = {2},
pages = {147--168},
}
Describing textures is a challenging problem in computer vision and pattern recognition. The classification problem involves assigning a category label to the texture class it belongs to. Several factors such as variations in scale, illumination and viewpoint make the problem of texture description extremely challenging. A variety of histogram based texture representations exists in literature. However, combining multiple texture descriptors and assessing their complementarity is still an open research problem. In this paper, we first show that combining multiple local texture descriptors significantly improves the recognition performance compared to using a single best method alone. This gain in performance is achieved at the cost of high-dimensional final image representation. To counter this problem, we propose to use an information-theoretic compression technique to obtain a compact texture description without any significant loss in accuracy. In addition, we perform a comprehensive evaluation of pure color descriptors, popular in object recognition, for the problem of texture classification. Experiments are performed on four challenging texture datasets namely, KTH-TIPS-2a, KTH-TIPS-2b, FMD and Texture-10. The experiments clearly demonstrate that our proposed compact multi-texture approach outperforms the single best texture method alone. In all cases, discriminative color names outperforms other color features for texture classification. Finally, we show that combining discriminative color names with compact texture representation outperforms state-of-the-art methods by 7.8%,4.3%7.8%,4.3% and 5.0%5.0% on KTH-TIPS-2a, KTH-TIPS-2b and Texture-10 datasets respectively.
@article{diva2:756961,
author = {Khan, Fahad Shahbaz and Muhammad Anwer, Rao and van de Weijer, Joost and Felsberg, Michael and Laaksonen, Jorma},
title = {{Compact color--texture description for texture classification}},
journal = {Pattern Recognition Letters},
year = {2015},
volume = {51},
pages = {16--22},
}
Computer analysis of visual art, especially paintings, is an interesting cross-disciplinary research domain. Most of the research in the analysis of paintings involve medium to small range datasets with own specific settings. Interestingly, significant progress has been made in the field of object and scene recognition lately. A key factor in this success is the introduction and availability of benchmark datasets for evaluation. Surprisingly, such a benchmark setup is still missing in the area of computational painting categorization. In this work, we propose a novel large scale dataset of digital paintings. The dataset consists of paintings from 91 different painters. We further show three applications of our dataset namely: artist categorization, style classification and saliency detection. We investigate how local and global features popular in image classification perform for the tasks of artist and style categorization. For both categorization tasks, our experimental results suggest that combining multiple features significantly improves the final performance. We show that state-of-the-art computer vision methods can correctly classify 50 % of unseen paintings to its painter in a large dataset and correctly attribute its artistic style in over 60 % of the cases. Additionally, we explore the task of saliency detection on paintings and show experimental findings using state-of-the-art saliency estimation algorithms.
@article{diva2:756963,
author = {Khan, Fahad Shahbaz and Beigpour, Shida and van de Weijer, Joost and Felsberg, Michael},
title = {{Painting-91:
a large scale database for computational painting categorization}},
journal = {Machine Vision and Applications},
year = {2014},
volume = {25},
number = {6},
pages = {1385--1397},
}
Person description is a challenging problem in computer vision. We investigated two major aspects of person description: 1) gender and 2) action recognition in still images. Most state-of-the-art approaches for gender and action recognition rely on the description of a single body part, such as face or full-body. However, relying on a single body part is suboptimal due to significant variations in scale, viewpoint, and pose in real-world images. This paper proposes a semantic pyramid approach for pose normalization. Our approach is fully automatic and based on combining information from full-body, upper-body, and face regions for gender and action recognition in still images. The proposed approach does not require any annotations for upper-body and face of a person. Instead, we rely on pretrained state-of-the-art upper-body and face detectors to automatically extract semantic information of a person. Given multiple bounding boxes from each body part detector, we then propose a simple method to select the best candidate bounding box, which is used for feature extraction. Finally, the extracted features from the full-body, upper-body, and face regions are combined into a single representation for classification. To validate the proposed approach for gender recognition, experiments are performed on three large data sets namely: 1) human attribute; 2) head-shoulder; and 3) proxemics. For action recognition, we perform experiments on four data sets most used for benchmarking action recognition in still images: 1) Sports; 2) Willow; 3) PASCAL VOC 2010; and 4) Stanford-40. Our experiments clearly demonstrate that the proposed approach, despite its simplicity, outperforms state-of-the-art methods for gender and action recognition.
@article{diva2:744038,
author = {Khan, Fahad and van de Weijer, Joost and Muhammad Anwer, Rao and Felsberg, Michael and Gatta, Carlo},
title = {{Semantic Pyramids for Gender and Action Recognition}},
journal = {IEEE Transactions on Image Processing},
year = {2014},
volume = {23},
number = {8},
pages = {3633--3645},
}
This article deals with fast and accurate visualization of pushbroom image data from airborne and spaceborne platforms. A pushbroom sensor acquires images in a line-scanning fashion, and this results in scattered input data that needs to be resampled onto a uniform grid for geometrically correct visualization. To this end, we model the anisotropic spatial dependence structure caused by the acquisition process. Several methods for scattered data interpolation are then adapted to handle the induced anisotropic metric and compared for the pushbroom image rectification problem. A trick that exploits the semi-ordered line structure of pushbroom data to improve the computational complexity several orders of magnitude is also presented.
@article{diva2:729174,
author = {Ringaby, Erik and Forss\'{e}n, Per-Erik and Friman, Ola and Olsvik Opsahl, Thomas and Vegard Haavardsholm, Trym and Kåsen, Ingebjørg},
title = {{Anisotropic Scattered Data Interpolation for Pushbroom Image Rectification}},
journal = {IEEE Transactions on Image Processing},
year = {2014},
volume = {23},
number = {5},
pages = {2302--2314},
}
In this article we investigate the problem of human action recognition in static images. By action recognition we intend a class of problems which includes both action classification and action detection (i.e. simultaneous localization and classification). Bag-of-words image representations yield promising results for action classification, and deformable part models perform very well object detection. The representations for action recognition typically use only shape cues and ignore color information. Inspired by the recent success of color in image classification and object detection, we investigate the potential of color for action classification and detection in static images. We perform a comprehensive evaluation of color descriptors and fusion approaches for action recognition. Experiments were conducted on the three datasets most used for benchmarking action recognition in still images: Willow, PASCAL VOC 2010 and Stanford-40. Our experiments demonstrate that incorporating color information considerably improves recognition performance, and that a descriptor based on color names outperforms pure color descriptors. Our experiments demonstrate that late fusion of color and shape information outperforms other approaches on action recognition. Finally, we show that the different color–shape fusion approaches result in complementary information and combining them yields state-of-the-art performance for action classification.
@article{diva2:647854,
author = {Khan, Fahad Shahbaz and Rao, Muhammad Anwer and van de Weijer, Joost and Bagdanov, Andrew and Lopez, Antonio and Felsberg, Michael},
title = {{Coloring Action Recognition in Still Images}},
journal = {International Journal of Computer Vision},
year = {2013},
volume = {105},
number = {3},
pages = {205--221},
}
In this paper, we introduce a novel framework for low-level image processing and analysis. First, we process images with very simple, difference-based filter functions. Second, we fit the 2-parameter Weibull distribution to the filtered output. This maps each image to the 2D Weibull manifold. Third, we exploit the information geometry of this manifold and solve low-level image processing tasks as minimisation problems on point sets. For a proof-of-concept example, we examine the image autofocusing task. We propose appropriate cost functions together with a simple implicitly-constrained manifold optimisation algorithm and show that our framework compares very favourably against common autofocus methods from literature. In particular, our approach exhibits the best overall performance in terms of combined speed and accuracy
@article{diva2:614862,
author = {Zografos, Vasileios and Lenz, Reiner and Felsberg, Michael},
title = {{The Weibull manifold in low-level image processing: an application to automatic image focusing.}},
journal = {Image and Vision Computing},
year = {2013},
volume = {31},
number = {5},
pages = {401--417},
}
Dosimetric accuracy of radiation treatment planning in brachytherapy depends on knowledge of tissue composition. It has been speculated that soft tissues can be decomposed to water, lipid and protein. The aim of our work is to evaluate the accuracy of such tissue decomposition. Selected abdominal soft tissues, whose average elemental compositions were taken from literature, were decomposed using dual energy computed tomography to water, lipid and protein via the three-material decomposition method. The quality of the decomposition was assessed using relative differences between (i) mass energy absorption and (ii) mass energy attenuation coefficients of the analyzed and approximated tissues. It was found that the relative differences were less than 2% for photon energies larger than 10 keV. The differences were notably smaller than the ones for water as the transport and dose scoring medium. The choice of the water, protein and lipid triplet resulted in negative elemental mass fractions for some analyzed tissues. As negative elemental mass fractions cannot be used in general purpose particle transport computer codes using the Monte Carlo method, other triplets should be used for the decomposition. These triplets may further improve the accuracy of the approximation as the differences were mainly caused by the lack of high-Z materials in the water, protein and lipid triplet.
@article{diva2:609387,
author = {Malusek, Alexandr and Karlsson, Mattias and Magnusson, Maria and Alm Carlsson, Gudrun},
title = {{The potential of dual-energy computed tomography for quantitative decomposition of soft tissues to water, protein and lipid in brachytherapy}},
journal = {Physics in Medicine and Biology},
year = {2013},
volume = {58},
number = {4},
pages = {771--785},
}
Perception-action (P-A) learning is an approach to cognitive system building that seeks to reduce the complexity associated with conventional environment-representation/action-planning approaches. Instead, actions are directly mapped onto the perceptual transitions that they bring about, eliminating the need for intermediate representation and significantly reducing training requirements. We here set out a very general learning framework for cognitive systems in which online learning of the P-A mapping may be conducted within a symbolic processing context, so that complex contextual reasoning can influence the P-A mapping. In utilizing a variational calculus approach to define a suitable objective function, the P-A mapping can be treated as an online learning problem via gradient descent using partial derivatives. Our central theoretical result is to demonstrate top-down modulation of low-level perceptual confidences via the Jacobian of the higher levels of a subsumptive P-A hierarchy. Thus, the separation of the Jacobian as a multiplying factor between levels within the objective function naturally enables the integration of abstract symbolic manipulation in the form of fuzzy deductive logic into the P-A mapping learning. We experimentally demonstrate that the resulting framework achieves significantly better accuracy than using P-A learning without top-down modulation. We also demonstrate that it permits novel forms of context-dependent multilevel P-A mapping, applying the mechanism in the context of an intelligent driver assistance system.
@article{diva2:572598,
author = {Windridge, David and Felsberg, Michael and Shaukat, Affan},
title = {{A Framework for Hierarchical Perception--Action Learning Utilizing Fuzzy Reasoning}},
journal = {IEEE transactions on systems, man and cybernetics. Part B. Cybernetics},
year = {2013},
volume = {43},
number = {1},
pages = {155--169},
}
We propose a novel method for iterative learning of point correspondences between image sequences. Points moving on surfaces in 3D space are projected into two images. Given a point in either view, the considered problem is to determine the corresponding location in the other view. The geometry and distortions of the projections are unknown as is the shape of the surface. Given several pairs of point-sets but no access to the 3D scene, correspondence mappings can be found by excessive global optimization or by the fundamental matrix if a perspective projective model is assumed. However, an iterative solution on sequences of point-set pairs with general imaging geometry is preferable. We derive such a method that optimizes the mapping based on Neyman's chi-square divergence between the densities representing the uncertainties of the estimated and the actual locations. The densities are represented as channel vectors computed with a basis function approach. The mapping between these vectors is updated with each new pair of images such that fast convergence and high accuracy are achieved. The resulting algorithm runs in real-time and is superior to state-of-the-art methods in terms of convergence and accuracy in a number of experiments.
@article{diva2:539757,
author = {Felsberg, Michael and Larsson, Fredrik and Wiklund, Johan and Wadströmer, Niclas and Ahlberg, Jörgen},
title = {{Online Learning of Correspondences between Images}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2013},
volume = {35},
number = {1},
pages = {118--129},
}
This article presents a method for rectifying and stabilising video from cell-phones with rolling shutter (RS) cameras. Due to size constraints, cell-phone cameras have constant, or near constant focal length, making them an ideal application for calibrated projective geometry. In contrast to previous RS rectification attempts that model distortions in the image plane, we model the 3D rotation of the camera. We parameterise the camera rotation as a continuous curve, with knots distributed across a short frame interval. Curve parameters are found using non-linear least squares over inter-frame correspondences from a KLT tracker. By smoothing a sequence of reference rotations from the estimated curve, we can at a small extra cost, obtain a high-quality image stabilisation. Using synthetic RS sequences with associated ground-truth, we demonstrate that our rectification improves over two other methods. We also compare our video stabilisation with the methods in iMovie and Deshaker.
@article{diva2:505943,
author = {Ringaby, Erik and Forss\'{e}n, Per-Erik},
title = {{Efficient Video Rectification and Stabilisation for Cell-Phones}},
journal = {International Journal of Computer Vision},
year = {2012},
volume = {96},
number = {3},
pages = {335--352},
}
This work proposes an approach to tracking by regression that uses no hard-coded models and no offline learning stage. The Linear Predictor (LP) tracker has been shown to be highly computationally efficient, resulting in fast tracking. Regression tracking techniques tend to require offline learning to learn suitable regression functions. This work removes the need for offline learning and therefore increases the applicability of the technique. The online-LP tracker can simply be seeded with an initial target location, akin to the ubiquitous Lucas-Kanade algorithm that tracks by registering an image template via minimisation. A fundamental issue for all trackers is the representation of the target appearance and how this representation is able to adapt to changes in target appearance over time. The two proposed methods, LP-SMAT and LP-MED, demonstrate the ability to adapt to large appearance variations by incrementally building an appearance model that identifies modes or aspects of the target appearance and associates these aspects to the Linear Predictor trackers to which they are best suited. Experiments comparing and evaluating regression and registration techniques are presented along with performance evaluations favourably comparing the proposed tracker and appearance model learning methods to other state of the art simultaneous modelling and tracking approaches.
@article{diva2:444819,
author = {Ellis, Liam and Dowson, Nicholas and Matas, Jiri and Bowden, Richard},
title = {{Linear Regression and Adaptive Appearance Models for Fast Simultaneous Modelling and Tracking}},
journal = {International Journal of Computer Vision},
year = {2011},
volume = {95},
number = {2},
pages = {154--179},
}
Recently the neural network based diagnosis of medical diseases has taken a great deal of attention. In this paper a parallel feed-forward neural network structure is used in the prediction of Parkinson’s Disease. The main idea of this paper is using more than a unique neural network to reduce the possibility of decision with error. The output of each neural network is evaluated by using a rule-based system for the final decision. Another important point in this paper is that during the training process, unlearned data of each neural network is collected and used in the training set of the next neural network. The designed parallel network system significantly increased the robustness of the prediction. A set of nine parallel neural networks yielded an improvement of 8.4% on the prediction of Parkinson’s Disease compared to a single unique network. Furthermore, it is demonstrated that the designed system, to some extent, deals with the problems of imbalanced data sets.
@article{diva2:424219,
author = {Åström, Freddie and Koker, Rasit},
title = {{A parallel neural network approach to prediction of Parkinson´s Disease}},
journal = {Expert systems with applications},
year = {2011},
volume = {38},
number = {10},
pages = {12470--12474},
}
In this article we describe a set of canonical transformations of the image spaces that make the description of three-view geometry very simple. The transformations depend on the three-view geometry and the canonically transformed trifocal tensor T' takes the form of a sparse array where 17 elements in well-defined positions are zero, it has a linear relation to the camera matrices and to two of the fundamental matrices, a third order relation to the third fundamental matrix, a second order relation to the other two trifocal tensors, and first order relations to the 10 three-view all-point matching constraints. In this canonical form, it is also simple to determine if the corresponding camera configuration is degenerate or co-linear. An important property of the three canonical transformations of the images spaces is that they are in SO(3). The 9 parameters needed to determine these transformations and the 9 parameters that determine the elements of T' together provide a minimal parameterization of the tensor. It does not have problems with multiple maps or multiple solutions that other parameterizations have, and is therefore simple to use. It also provides an implicit representation of the trifocal internal constraints: the sparse canonical representation of the trifocal tensor can be determined if and only if it is consistent with its internal constraints. In the non-ideal case, the canonical transformation can be determined by solving a minimization problem and a simple algorithm for determining the solution is provided. This allows us to extend the standard linear method for estimation of the trifocal tensor to include a constraint enforcement as a final step, similar to the constraint enforcement of the fundamental matrix.
Experimental evaluation of this extended linear estimation method shows that it significantly reduces the geometric error of the resulting tensor, but on average the algebraic estimation method is even better. For a small percentage of cases, however, the extended linear method gives a smaller geometric error, implying that it can be used as a complement to the algebraic method for these cases.
@article{diva2:409186,
author = {Nordberg, Klas},
title = {{The Key to Three-View Geometry}},
journal = {International Journal of Computer Vision},
year = {2011},
volume = {94},
number = {3},
pages = {282--294},
}
The Fourier descriptors (FDs) is a classical but still popular method for contour matching. The key idea is to apply the Fourier transform to a periodic representation of the contour, which results in a shape descriptor in the frequency domain. FDs are most commonly used to compare object silhouettes and object contours; the authors instead use this well-established machinery to describe local regions to be used in an object-recognition framework. Many approaches to matching FDs are based on the magnitude of each FD component, thus ignoring the information contained in the phase. Keeping the phase information requires us to take into account the global rotation of the contour and shifting of the contour samples. The authors show that the sum-of-squared differences of FDs can be computed without explicitly de-rotating the contours. The authors compare correlation-based matching against affine-invariant Fourier descriptors (AFDs) and WARP-matched FDs and demonstrate that correlation-based approach outperforms AFDs and WARP on real data. As a practical application the authors demonstrate the proposed correlation-based matching on a road sign recognition task.
@article{diva2:397235,
author = {Larsson, Fredrik and Felsberg, Michael and Forssen, Per-Erik},
title = {{Correlating Fourier descriptors of local patches for road sign recognition}},
journal = {IET Computer Vision},
year = {2011},
volume = {5},
number = {4},
pages = {244--254},
}
In this paper, we present a novel scheme for anisotropic diffusion driven by the image autocorrelation function. We show the equivalence of this scheme to a special case of iterated adaptive filtering. By determining the diffusion tensor field from an autocorrelation estimate, we obtain an evolution equation that is computed from a scalar product of diffusion tensor and the image Hessian. We propose further a set of filters to approximate the Hessian on a minimized spatial support. On standard benchmarks, the resulting method performs favorable in many cases, in particular at low noise levels. In a GPU implementation, video real-time performance is easily achieved.
@article{diva2:395634,
author = {Felsberg, Michael},
title = {{Autocorrelation-Driven Diffusion Filtering}},
journal = {IEEE Transactions on Image Processing},
year = {2011},
volume = {20},
number = {7},
pages = {1797--1806},
}
In this paper, we study object recognition in the embodied setting. More specifically, we study the problem of whether the recognition system will benefit from acquiring another observation of the object under study, or whether it is time to give up, and report the observed object as unknown. We describe the hardware and software of a system that implements recognition and object permanence as two nested perception-action cycles. We have collected three data sets of observation sequences that allow us to perform controlled evaluation of the system behavior. Our recognition system uses a KNN classifier with bag-of-features prototypes. For this classifier, we have designed and compared three different uncertainty measures for target observation. These measures allow the system to (a) decide whether to continue to observe an object or to move on, and to (b) decide whether the observed object is previously seen or novel. The system is able to successfully reject all novel objects as “unknown”, while still recognizing most of the previously seen objects.
@article{diva2:378735,
author = {Wallenberg, Marcus and Forss\'{e}n, Per-Erik},
title = {{Embodied Object Recognition using Adaptive Target Observations}},
journal = {Cognitive Computation},
year = {2010},
volume = {2},
number = {4},
pages = {316--325},
}
Recent years have seen advances in the estimation of full 6 degree-of-freedom object pose from a single 2D image. These advances have often been presented as a result of, or together with, a new local image feature type. This paper examines how the pose accuracy and recognition robustness for such a system varies with choice of feature type. This is done by evaluating a full 6 degree-of-freedom pose estimation system for 17 different combinations of local descriptors and detectors. The evaluation is done on data sets with photos of challenging 3D objects with simple and complex backgrounds and varying illumination conditions. We examine the performance of the system under varying levels of object occlusion and we find that many features allow considerable object occlusion. From the experiments we can conclude that duplet features, that use pairs of interest points, improve pose estimation accuracy, compared to single point features. Interestingly, we can also show that many features previously used for recognition and wide-baseline stereo are unsuitable for pose estimation, one notable example are the affine covariant features that have been proven quite successful in other applications. The data sets and their ground truths are available on the web to allow future comparison with novel algorithms.
@article{diva2:325003,
author = {Viksten, Fredrik and Forss\'{e}n, Per-Erik and Johansson, Björn and Moe, Anders},
title = {{Local Image Descriptors for Full 6 Degree-of-Freedom Object Pose Estimation and Recognition}},
journal = {},
year = {2010},
}
This article presents a computationally efficient approach to the triangulation of 3D points from their projections in two views. The homogenous coordinates of a 3D point is given as a multi-linear mapping on its homogeneous image coordinates, a computation of low computational complexity. The multi-linear mapping is a tensor, and an element of a projective space, that can be computed directly from the camera matrices and some parameters. These parameters imply that the tensor is not unique: for a given camera pair the subspace K of triangulation tensors is six-dimensional. The triangulation tensor is 3D projective covariant and satisfies a set of internal constraints. Reconstruction of 3D points using the proposed tensor is studied for the non-ideal case, when the image coordinates are perturbed by noise and the epipolar constraint exactly is not satisfied exactly. A particular tensor of K is then the optimal choice for a simple reduction of 3D errors, and we present a computationally efficient approach for determining this tensor. This approach implies that normalizing image coordinate transformations are important for obtaining small 3D errors.
In addition to computing the tensor from the cameras, we also investigate how it can be further optimized relative to error measures in the 3D and 2D spaces. This optimization is evaluated for sets of real 3D + 2D + 2D data by comparing the reconstruction to some of the triangulation methods found in the literature, in particular the so-called optimal method that minimizes 2D L2 errors. The general conclusion is that, depending on the choice of error measure and the optimization implementation, it is possible to find a tensor that produces smaller 3D errors (both L1 and L2) but slightly larger 2D errors than the optimal method does. Alternatively, we may find a tensor that gives approximately comparable results to the optimal method in terms of both 3D and 2D errors. This means that the proposed tensor based method of triangulation is both computationally efficient and can be calibrated to produce small reconstruction or reprojection errors for a given data set.
@article{diva2:271768,
author = {Nordberg, Klas},
title = {{The triangulation tensor}},
journal = {Computer Vision and Image Understanding},
year = {2009},
volume = {113},
number = {9},
pages = {935--945},
}
The major goal of the COSPAL project is to develop an artificial cognitive system architecture, with the ability to autonomously extend its capabilities. Exploratory learning is one strategy that allows an extension of competences as provided by the environment of the system. Whereas classical learning methods aim at best for a parametric generalization, i.e., concluding from a number of samples of a problem class to the problem class itself, exploration aims at applying acquired competences to a new problem class, and to apply generalization on a conceptual level, resulting in new models. Incremental or online learning is a crucial requirement to perform exploratory learning. In the COSPAL project, we mainly investigate reinforcement-type learning methods for exploratory learning, and in this paper we focus on the organization of cognitive systems for efficient operation. Learning is used over the entire system. It is organized in the form of four nested loops, where the outermost loop reflects the user-reinforcement-feedback loop, the intermediate two loops switch between different solution modes at symbolic respectively sub-symbolic level, and the innermost loop performs the acquired competences in terms of perception-action cycles. We present a system diagram which explains this process in more detail. We discuss the learning strategy in terms of learning scenarios provided by the user. This interaction between user (teacher) and system is a major difference to classical robotics systems, where the system designer places his world model into the system. We believe that this is the key to extendable robust system behavior and successful interaction of humans and artificial cognitive systems. We furthermore address the issue of bootstrapping the system, and, in particular, the visual recognition module. We give some more in-depth details about our recognition method and how feedback from higher levels is implemented. The described system is however work in progress and no final results are available yet. The available preliminary results that we have achieved so far, clearly point towards a successful proof of the architecture concept.
@article{diva2:240894,
author = {Felsberg, Michael and Wiklund, Johan and Granlund, Gösta},
title = {{Exploratory learning structures in artificial cognitive systems}},
journal = {Image and Vision Computing},
year = {2009},
volume = {27},
number = {11},
pages = {1671--1687},
}
Channel-coded feature maps (CCFMs) represent arbitrary image features using multi-dimensional histograms with soft and overlapping bins. This representation can be seen as a generalization of the SIFT descriptor, where one advantage is that it is better suited for computing derivatives with respect to image transformations. Using these derivatives, a local optimization of image scale, rotation and position relative to a reference view can be computed. If piecewise polynomial bin functions are used, e.g. B-splines, these histograms can be computed by first encoding the data set into a histogram-like representation with non-overlapping multi-dimensional monomials as bin functions. This representation can then be processed using multi-dimensional convolutions to obtain the desired representation. This allows to reuse much of the computations for the derivatives. By comparing the complexity of this method to direct encoding, it is found that the piecewise method is preferable for large images and smaller patches with few channels, which makes it useful, e.g. in early steps of coarse-to-fine approaches.
@article{diva2:240892,
author = {Jonsson, Erik and Felsberg, Michael},
title = {{Efficient computation of channel-coded feature maps through piecewise polynomials}},
journal = {Image and Vision Computing},
year = {2009},
volume = {27},
number = {11},
pages = {1688--1694},
}
@article{diva2:240891,
author = {Granlund, Gösta},
title = {{Special issue on Perception, Action and Learning}},
journal = {Image and Vision Computing},
year = {2009},
volume = {27},
number = {11},
pages = {1639--1640},
}
In this paper, we present a visual servoing method based on a learned mapping between feature space and control space. Using a suitable recognition algorithm, we present and evaluate a complete method that simultaneously learns the appearance and control of a low-cost robotic arm. The recognition part is trained using an action precedes perception approach. The novelty of this paper, apart from the visual servoing method per se, is the combination of visual servoing with gripper recognition. We show that we can achieve high precision positioning without knowing in advance what the robotic arm looks like or how it is controlled.
@article{diva2:240889,
author = {Larsson, Fredrik and Jonsson, Erik and Felsberg, Michael},
title = {{Simultaneously learning to recognize and control a low-cost robotic arm}},
journal = {Image and Vision Computing},
year = {2009},
volume = {27},
number = {11},
pages = {1729--1739},
}
This paper presents a method that combines shadow detection and a 3D box model including shadow simulation, for estimation of size and position of vehicles. We define a similarity measure between a simulated image of a 3D box, including the box shadow, and a captured image that is classified into background/foreground/shadow. The similarity Measure is used in all optimization procedure to find the optimal box state. It is shown in a number of experiments and examples how the combination shadow detection/simulation improves the estimation compared to just using detection or simulation, especially when the shadow detection or the simulation is inaccurate. We also describe a tracking system that utilizes the estimated 3D boxes, including highlight detection, spatial window instead of a time based window for predicting heading, and refined box size estimates by weighting accumulated estimates depending oil view. Finally, we show example results.
@article{diva2:224936,
author = {Johansson, Björn and Wiklund, Johan and Forss\'{e}n, Per-Erik and Granlund, Gösta},
title = {{Combining shadow detection and simulation for estimation of vehicle size and position}},
journal = {PATTERN RECOGNITION LETTERS},
year = {2009},
volume = {30},
number = {8},
pages = {751--759},
}
Intrinsic dimensionality is a concept introduced by statistics and later used in image processing to measure the dimensionality of a data set. In this paper, we introduce a continuous representation of the intrinsic dimension of an image patch in terms of its local spectrum or, equivalently, its gradient field. By making use of a cone structure and barycentric co-ordinates, we can associate three confidences to the three different ideal cases of intrinsic dimensions corresponding to homogeneous image patches, edge-like structures and junctions. The main novelty of our approach is the representation of confidences as prior probabilities which can be used within a probabilistic framework. To show the potential of our continuous representation, we highlight applications in various contexts such as image structure classification, feature detection and localisation, visual scene statistics and optic flow evaluation.
@article{diva2:214515,
author = {Felsberg, Michael and Kalkan, Sinan and Krüger, Norbert},
title = {{Continuous dimensionality characterization of image structures}},
journal = {Image and Vision Computing},
year = {2009},
volume = {27},
number = {6},
pages = {628--636},
}
This article introduces a new region based feature for object recognition and image matching. In contrast to many other region based features, this one makes use of colour in the feature extraction stage. We perform experiments on the repeatability rate of the features across scale and inclination angle changes, and show that avoiding to merge regions connected by only a few pixels improves the repeatability. We introduce two voting schemes that allow us to find correspondences automatically, and compare them with respect to the number of valid correspondences they give, and their inlier ratios. We also demonstrate how the matching procedure can be applied to colour correction.
@article{diva2:133538,
author = {Forssen, Per-Erik and Moe, Anders},
title = {{View matching with blob features}},
journal = {Image and Vision Computing},
year = {2009},
volume = {27},
number = {1-2},
pages = {99--107},
}
@article{diva2:265735,
author = {Felsberg, Michael},
title = {{COSPAL -- A Study on Artificial Cognitive Systems}},
journal = {Engineering \& technology},
year = {2008},
volume = {3},
number = {18},
}
Inspired by the early visual system of many mammalians we consider the construction of-and reconstruction from- an orientation score Uf:R2×S1→C as a local orientation representation of an image, f:R2→R . The mapping f↦Uf is a wavelet transform Wψ corresponding to a reducible representation of the Euclidean motion group onto L2(R2) and oriented wavelet ψ∈L2(R2) . This wavelet transform is a special case of a recently developed generalization of the standard wavelet theory and has the practical advantage over the usual wavelet approaches in image analysis (constructed by irreducible representations of the similitude group) that it allows a stable reconstruction from one (single scale) orientation score. Since our wavelet transform is a unitary mapping with stable inverse, we directly relate operations on orientation scores to operations on images in a robust manner.
Furthermore, by geometrical examination of the Euclidean motion group G=R2R×T , which is the domain of our orientation scores, we deduce that an operator Φ on orientation scores must be left invariant to ensure that the corresponding operator W−1ψΦWψ on images is Euclidean invariant. As an example we consider all linear second order left invariant evolutions on orientation scores corresponding to stochastic processes on G. As an application we detect elongated structures in (medical) images and automatically close the gaps between them.
Finally, we consider robust orientation estimates by means of channel representations, where we combine robust orientation estimation and learning of wavelets resulting in an auto-associative processing of orientation features. Here linear averaging of the channel representation is equivalent to robust orientation estimation and an adaptation of the wavelet to the statistics of the considered image class leads to an auto-associative behavior of the system.
@article{diva2:262428,
author = {Duits, Remco and Felsberg, Michael and Granlund, Gösta and ter Haar Romeny, Bart M.},
title = {{Image Analysis and Reconstruction using a Wavelet Transform Constructed from a Reducible Representation of the Euclidean Motion Group}},
journal = {International Journal of Computer Vision},
year = {2007},
volume = {72},
number = {1},
pages = {79--102},
}
In this paper we propose a new approach to real-time view-based pose recognition and interpolation. Pose recognition is particularly useful for identifying camera views in databases, video sequences, video streams, and live recordings. All of these applications require a fast pose recognition process, in many cases video real-time. It should further be possible to extend the database with new material, i.e., to update the recognition system online. The method that we propose is based on P-channels, a special kind of information representation which combines advantages of histograms and local linear models. Our approach is motivated by its similarity to information representation in biological systems but its main advantage is its robustness against common distortions such as clutter and occlusion. The recognition algorithm consists of three steps: (1) low-level image features for color and local orientation are extracted in each point of the image; (2) these features are encoded into P-channels by combining similar features within local image regions; (3) the query P-channels are compared to a set of prototype P-channels in a database using a least-squares approach. The algorithm is applied in two scene registration experiments with fisheye camera data, one for pose interpolation from synthetic images and one for finding the nearest view in a set of real images. The method compares favorable to SIFT-based methods, in particular concerning interpolation. The method can be used for initializing pose-tracking systems, either when starting the tracking or when the tracking has failed and the system needs to re-initialize. Due to its real-time performance, the method can also be embedded directly into the tracking system, allowing a sensor fusion unit choosing dynamically between the frame-by-frame tracking and the pose recognition.
@article{diva2:260354,
author = {Felsberg, Michael and Hedborg, Johan},
title = {{Real-Time View-Based Pose Recognition and Interpolation for Tracking Initialization}},
journal = {Journal of Real-Time Image Processing},
year = {2007},
volume = {2},
number = {2-3},
pages = {103--115},
}
Augmented reality is a growing field, with many diverse applications ranging from TV and film production, to industrial maintenance, medicine, education, entertainment and games. The central idea is to add virtual objects into a real scene, either by displaying them in a see-through head-mounted display, or by superimposing them on an image of the scene captured by a camera. Depending on the application, the added objects might be virtual characters in a TV or film production, instructions for repairing a car engine, or a reconstruction of an archaeological site. For the effect to be believable, the virtual objects must appear rigidly fixed to the real world, which requires the accurate measurement in real-time of the position of the camera or the user-s head. Present technology cannot achieve this without resorting to systems that require a significant infrastructure in the operating environment, severely restricting the range of possible applications.
@article{diva2:260355,
author = {Felsberg, Michael and Koch, Reinhard},
title = {{Editorial for the special issue on markerless real-time tracking for augmented reality image synthesis}},
journal = {Journal of Real-Time Image Processing},
year = {2007},
volume = {2},
number = {2-3},
pages = {67--68},
}
In order to insert a virtual object into a TV image, the graphics system needs to know precisely how the camera is moving, so that the virtual object can be rendered in the correct place in every frame. Nowadays this can be achieved relatively easily in post-production, or in a studio equipped with a special tracking system. However, for live shooting on location, or in a studio that is not specially equipped, installing such a system can be difficult or uneconomic. To overcome these limitations, the MATRIS project is developing a real-time system for measuring the movement of a camera. The system uses image analysis to track naturally occurring features in the scene, and data from an inertial sensor. No additional sensors, special markers, or camera mounts are required. This paper gives an overview of the system and presents some results.
@article{diva2:259717,
author = {Chandaria, Jigna and Thomas, Graham and Bartczak, Bogumil and Koch, Reinhard and Becker, Mario and Bleser, Gabriele and Stricker, Didier and Wohlleber, Cedric and Gustafsson, Fredrik and Felsberg, Michael and Hol, Jeroen and Schön, Thomas and Skoglund, Johan and Slycke, Per and Smeitz, Sebastiaan},
title = {{Real-Time Camera Tracking in the MATRIS Project}},
journal = {Smpte Journal},
year = {2007},
volume = {116},
number = {7-8},
pages = {266--271},
}
This paper brings together a novel information representation model for use in signal processing and computer vision problems, with a particular algorithmic development of the Landweber iterative algorithm. The information representation model allows a representation of multiple values for a variable as well as an expression for confidence. Both properties are important for effective computation using multi-level models, where a choice between models will be implementable as part of the optimization process. It is shown that in this way the algorithm can deal with a class of high-dimensional, sparse, and constrained least-squares problems, which arise in various computer vision learning tasks, such as object recognition and object pose estimation. While the algorithm has been applied to the solution of such problems, it has so far been used heuristically. In this paper we describe the properties and some of the peculiarities of the channel representation and optimization, and put them on firm mathematical ground. We consider the optimization a convexly constrained weighted least-squares problem and propose for its solution a projected Landweber method which employs oblique projections onto the closed convex constraint set. We formulate the problem, present the algorithm and work out its convergence properties, including a rate-of-convergence result. The results are put in perspective with currently available projected Landweber methods. An application to supervised learning is described, and the method is evaluated in an experiment involving function approximation, as well as application to transient signals. © 2006 Elsevier Ltd. All rights reserved.
@article{diva2:271150,
author = {Johansson, Björn and Elfving, Tommy and Kozlov, Vladimir and Censor, Y. and Forss\'{e}n, Per-Erik and Granlund, Gösta},
title = {{The application of an oblique-projected Landweber method to a model of supervised learning}},
journal = {Mathematical and computer modelling},
year = {2006},
volume = {43},
number = {7-8},
pages = {892--909},
}
In this paper, we present a new and efficient method to implement robust smoothing of low-level signal features: B-spline channel smoothing. This method consists of three steps: encoding of the signal features into channels, averaging of the channels, and decoding of the channels. We show that linear smoothing of channels is equivalent to robust smoothing of the signal features if we make use of quadratic B-splines to generate the channels. The linear decoding from B-spline channels allows the derivation of a robust error norm, which is very similar to Tukey's biweight error norm. We compare channel smoothing with three other robust smoothing techniques: nonlinear diffusion, bilateral filtering, and mean-shift filtering, both theoretically and on a 2D orientation-data smoothing task. Channel smoothing is found to be superior in four respects: It has a lower computational complexity, it is easy to implement, it chooses the global minimum error instead of the nearest local minimum, and it can also be used on nonlinear spaces, such as orientation space. © 2006 IEEE.
@article{diva2:270945,
author = {Felsberg, Michael and Forssen, P.-E. and Scharr, H.},
title = {{Channel smoothing:
Efficient robust smoothing of low-level signal features}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2006},
volume = {28},
number = {2},
pages = {209--222},
}
A runtime system for implementation of image processing operations is presented. It is designed for working in a flexible and distributed environment related to the software architecture of a newly developed UAV system. The software architecture can be characterized at a coarse scale as a layered system, with a deliberative layer at the top, a reactive layer in the middle, and a processing layer at the bottom. At a finer scale each of the three levels is decomposed into sets of modules which communicate using CORBA, allowing system development and deployment on the UAV to be made in a highly flexible way. Image processing takes place in a dedicated module located in the process layer, and is the main focus of the paper. This module has been designed as a runtime system for data flow graphs, allowing various processing operations to be created online and on demand by the higher levels of the system. The runtime system is implemented in Java, which allows development and deployment to be made on a wide range of hardware/software configurations. Optimizations for particular hardware platforms have been made using Java's native interface.
@article{diva2:268968,
author = {Nordberg, Klas and Doherty, Patrick and Forss\'{e}n, Per-Erik and Wiklund, Johan and Andersson, Per},
title = {{A flexible runtime system for image processing in a distributed computational environment for an unmanned aerial vehicle}},
journal = {International Journal of Pattern Recognition and Artificial Intelligence},
year = {2006},
volume = {20},
number = {5},
pages = {763--780},
}
This paper describes a method for vision-based unmanned aerial vehicle (UAV) motion estimation from multiple planar homographies. The paper also describes the determination of the relative displacement between different UAVs employing techniques for blob feature extraction and matching. It then presents and shows experimental results of the application of the proposed technique to multi-UAV detection of forest fires.
@article{diva2:262438,
author = {Merino, Luis and Wiklund, Johan and Caballero, Fernando and Moe, Anders and Martinez-de Dios, Jose Ramiro and Forss\'{e}n, Per-Erik and Nordberg, Klas and Ollero, Annibal},
title = {{Vision-Based Multi-UAV Position Estimation}},
journal = {IEEE Robotics \& Automation Magazine},
year = {2006},
volume = {13},
number = {3},
pages = {53--62},
}
In medical helical cone-beam CT, it is common that the region-of-interest (ROI) is contained inside the helix cylinder, while the complete object is long and extends outside the top and the bottom of the cylinder. This is the Long Object Problem. Analytical reconstruction methods for helical cone-beam CT have been designed to handle this problem. It has been shown that a moderate amount of over-scanning is sufficient for reconstruction of a certain ROI. The over-scanning projection rays travel both through the ROI as well as outside the ROI. This is unfortunate for iterative methods since it seems impossible to compute accurate values for the projection rays which travel partly inside and partly outside the ROI. Therefore, it seems that the useful ROI will diminish for every iteration step. We propose the following solution to the problem. Firstly, we reconstruct volume regions also outside the ROI. These volume regions will certainly be incompletely reconstructed, but our experimental results show that they serve well for projection generation. This is rather counter-intuitive and contradictory to our initial assumptions. Secondly, we use careful extrapolation and masking of projection data. This is not a general necessity, but needed for the chosen iterative algorithm, which includes rebinning and iterative filtered backprojection. Our idea here was to use an approximate reconstruction method which gives cone-beam artifacts and then improve the reconstructed result by iterative filtered backprojection. The experimental results seem very encouraging. The cone-beam artifacts can indeed be removed. Even voxels close to the boundary of the ROI are as well enhanced by the iterative loop as those in the middle of the ROI.
@article{diva2:258148,
author = {Magnusson, Maria and Danielsson, Per-Erik and Sunnegårdh, Johan},
title = {{Handling of Long Objects in Iterative Improvement of Non-Exact Reconstruction in Helical Cone-Beam CT}},
journal = {IEEE Transactions on Medical Imaging},
year = {2006},
volume = {25},
number = {7},
pages = {935--940},
}
A fundamental property of cognitive vision systems is that they shall be extendable, which requires that they can both acquire and store information autonomously. The paper discusses organization of systems to allow this, and proposes an architecture for cognitive vision systems. The architecture consists of two parts. The first part, step by step learns a mapping from percepts directly onto actions or states. In the learning phase, action precedes perception, as action space is much less complex. This requires a semantic information representation, allowing computation and storage with respect to similarity. The second part uses invariant or symbolic representations, which are derived mainly from system and action states. Through active exploration, a system builds up concept spaces or models. This allows the system to subsequently acquire information using passive observation or language. The structure has been used to learn object properties, and constitutes the basic concepts for a European project COSPAL, within the IST programme.
@article{diva2:258030,
author = {Granlund, Gösta},
title = {{A Cognitive Vision Architecture Integrating Neural Networks with Symbolic Processing}},
journal = {Künstliche Intelligenz},
year = {2006},
number = {2},
pages = {18--24},
}
Tensors have become a popular tool for representation of local orientation and can be used also for estimation of velocity. A number of computational approaches have been presented for tensor estimation which, however, are difficult to analyze or compare since there has been no common framework in which analysis or comparisons can be made. In this article, we propose such a framework based on second-order filters and show how it applies to three different methods for tensor estimation. The framework contains a few conditions on the filters which are sufficient to obtain correctly oriented rank one tensors for the case of simple signals. It also allows the derivation of explicit expressions for the variation of the tensor across oriented structures which, e.g., can be used to formulate conditions for phase invariance. (c) 2005 Elsevier B.V. All rights reserved.
@article{diva2:266999,
author = {Nordberg, Klas and Farnebäck, Gunnar},
title = {{Estimation of orientation tensors for simple signals by means of second-order filters}},
journal = {Signal Processing: Image Communication},
year = {2005},
volume = {20},
number = {6},
pages = {582--594},
}
The management of environmental and industrial disasters, search and rescue operations, surveillance of natural scenarios, environmental monitoring, and many other field robotics applications require high mobility and the need to reach locations that are difficult to access with ground vehicles. In many cases, the use of aerial vehicles is the best way to approach the objective to get information or to deploy instrumentation. Unmanned air vehicles (UAVs) have significantly increased their flight performance and autonomous onboard processing capabilities in the last ten years. But a single aerial vehicle equipped with a large array of different sensors of various modalities is limited at any time to a single viewpoint. A team of aerial vehicles, however, can simultaneously collect information from multiple locations and exploit the information derived from multiple disparate points. Furthermore, having a team with multiple heterogeneous aerial vehicles offers additional advantages due to the possibility of beneficial complementarities of the vehicles.
@article{diva2:258040,
author = {Ollero, Anibal and Lacroix, Simon and Merino, Luis and Gancet, Jeremi and Wiklund, Johan and Remuß, Volker and Perez, Iker Veiga and Guti\'{e}rrez, Luis G. and Viegas, Domingos Xavier and Benitez, Miguel Angel González and Mallet, Anthony and Alami, Rachid and Chatila, Raja and Hommel, Günter and Lechuga, F. J. Colmenero and Arrue, Begoña C. and Ferruz, Joaquin and Martinez-de Dios, Jose Ramiro and Caballero, Fernando},
title = {{Multiple Eyes in the Skies}},
journal = {IEEE robotics \& automation magazine},
year = {2005},
volume = {12},
number = {2},
pages = {46--57},
}
In this paper we present a novel method to implement the monogenic scale space on a rectangular domain. The monogenic scale space is a vector valued scale space based on the Poisson scale space, which establishes a sophisticated alternative to the Gaussian scale space. Previous implementations of the monogenic scale space are Fourier transform based, and therefore suffer from the implicit periodicity in case of finite domains. The features of the monogenic scale space, including local amplitude, local phase, local orientation, local frequency, and phase congruency, are much easier to interpret in terms of image features evolving through scale than in the Gaussian case. Furthermore, applying results from harmonic analysis, relations between the features are obtained which improve the understanding of image analysis. As applications, we present a very simple but still accurate approach to image reconstruction from local amplitude and local phase and a method for extracting the evolution of lines and edges through scale.
@article{diva2:244285,
author = {Felsberg, Michael and Duits, R. and Florack, L.},
title = {{The Monogenic Scale Space on a Rectangular Domain and its Features}},
journal = {International Journal of Computer Vision},
year = {2005},
volume = {64},
number = {2--3},
}
The effect of scatter on reconstructed image quality in conebeam computed tomography was investigated and a function whichcan be used in scatter-reduction optimisation tasks was tested.Projections were calculated using the Monte Carlo method inan axially symmetric cone beam geometry consisting of a pointsource, water phantom and a single row of detector elements.Image reconstruction was performed using the filtered backprojectionmethod. Image quality was assessed by the L2-norm-based differencerelative to a reference image derived from (1) weighted linearattenuation coefficients and (2) projections by primary photons.It was found that the former function was strongly affectedby the beam hardening artefact and did not properly reflectthe amount of scatter but the latter function increased withincreasing beam width, was higher for the larger phantom andexhibited properties which made it a good candidate for scatter-reductionoptimisation tasks using polyenergetic beams.
@article{diva2:17697,
author = {Malusek, Alexandr and Magnusson Seger, Maria and Sandborg, Michael and Alm Carlsson, Gudrun},
title = {{Effect of scatter on reconstructed image quality in cone beam CT:
evaluation of a scatterreduction optimization function}},
journal = {Radiation Protection Dosimetry},
year = {2005},
volume = {114},
number = {1-3},
pages = {337--340},
}
We introduce a compact coding of image information in terms of local multi-modal image descriptors. This coding allows for an explicit separation of the local image information into different visual sub-modalities: geometric information (orientation) and structural image information (contrast transition and colour). Based on this image representation, we derive a similarity function that compares visual information in each of these sub-modalities. This allows for an investigation of the importance of the different factors for stereo matching on a large data set. From this investigation we conclude that it is the combination of visual modalities that gives the best results. Concrete weights for their relative importance are measured. In addition to these quantitative results, we can demonstrate by our simulations that although our image representation reduces image information by 97% we achieve a matching performance which is comparable to block matching techniques. This shows that our very condensed representation preserves the relevant visual information. © 2004 Elsevier B.V. All rights reserved.
@article{diva2:266619,
author = {Kruger, N. and Felsberg, Michael},
title = {{An explicit and compact coding of geometric and structural image information applied to stereo processing}},
journal = {Pattern Recognition Letters},
year = {2004},
volume = {25},
number = {8},
pages = {849--863},
}
In this paper we address the topics of scale-space and phase-based image processing in a unifying framework. In contrast to the common opinion, the Gaussian kernel is not the unique choice for a linear scale-space. Instead, we chose the Poisson kernel since it is closely related to the monogenic signal, a 2D generalization of the analytic signal, where the Riesz transform replaces the Hilbert transform. The Riesz transform itself yields the flux of the Poisson scale-space and the combination of flux and scale-space, the monogenic scale-space, provides the local features phase-vector and attenuation in scale-space. Under certain assumptions, the latter two again form a monogenic scale-space which gives deeper insight to low-level image processing. In particular, we discuss edge detection by a new approach to phase congruency and its relation to amplitude based methods, reconstruction from local amplitude and local phase, and the evaluation of the local frequency.
@article{diva2:262433,
author = {Felsberg, Michael and Sommer, Gerald},
title = {{The Monogenic Scale-Space:
A Unifying Approach to Phase-Based Image Processing in Scale-Space}},
journal = {Journal of Mathematical Imaging and Vision},
year = {2004},
volume = {21},
}
A method for unrestricted recognition of 3-D objects has been developed. By unrestricted, we imply that the recognition shall be done independently of object position, scale, orientation and pose, against a structured background. It shall not assume any preceding segmentation and allow a reasonable degree of occlusion. The method uses a hierarchy of triplet feature invariants, which are at each level defined by a learning procedure. In the feed-back learning procedure, percepts are mapped upon system states corresponding to manipulation parameters of the object. The method uses a learning architecture employing channel information representation. The paper contains a discussion of how objects can be represented. A structure is proposed to deal with object and contextual properties in a transparent manner.
@article{diva2:258146,
author = {Granlund, Gösta and Moe, Anders},
title = {{Unrestricted Recognition of 3-D Objects for Robotics Using Multi-Level Triplet Invariants}},
journal = {Artificial Intelligence Magazine},
year = {2004},
volume = {25},
number = {2},
pages = {51--67},
}
This paper introduces a two-dimensional generalization of the analytic signal. This novel approach is based on the Riesz transform, which is used instead of the Hilbert transform. The combination of a 2D signal with the Riesz transformed one yields a sophisticated 2D analytic signal, the monogenic signal. The approach is derived analytically from irrotational and solenoidal vector fields. Based on local amplitude and local phase, an appropriate local signal representation is presented which preserves the split of identity, i.e., the invariance – equivariance property of signal decomposition. This is one of the central properties of the 1D analytic signal that decomposes a signal into structural and energetic information. We show that further properties of the analytic signal concerning symmetry, energy, allpass transfer function, and orthogonality are also preserved, and we compare this to the behavior of other approaches for a 2D analytic signal. As a central topic of this paper, a geometric phase interpretation is introduced which is based on the relation between the 1D analytic signal and the 2D monogenic signal established by the Radon transform. Possible applications of this relationship are sketched and references to other applications of the monogenic signal are given. This report is a revised version of the technical report 2009 [7], and therefore supercedes it.
@article{diva2:241550,
author = {Felsberg, Michael and Sommer, Gerald},
title = {{The monogenic signal}},
journal = {IEEE Transactions on Signal Processing},
year = {2001},
volume = {49},
number = {12},
pages = {3136--3144},
}
Estimation of local second-degree variation should be a natural first step in computerized image analysis, just as it seems to be in human vision. A prevailing obstacle is that the second derivatives entangle the three features signal strength (i.e. magnitude or energy), orientation and shape. To disentangle these features we propose a technique where the orientation of an arbitrary pattern f is identified with the rotation required to align the pattern with its prototype p. This is more strictly formulated as solving the derotating equation. The set of all possible prototypes spans the shape-space of second degree variation. This space is one-dimensional for 2Dimages, two-dimensional for 3D-images. The derotation decreases the original dimensionality of the response vector from three to two in the 2D-case and from six to three in the 3D-case, in both cases leaving room only for magnitude and shape in the prototype. The solution to the derotation and a full understanding of the result requires i) mapping the derivatives of the pattern f onto the orthonormal basis of spherical harmonics, and ii) identifying the eigenvalues of the Hessian with the derivatives of the prototype p. But once the shape-space is established the possibilities to put together independent discriminators for magnitude, orientation, and shape are easy and almost limitless.
@article{diva2:241538,
author = {Danielsson, Per-Erik and Lin, Qingfen and Ye, Qin-Zhong},
title = {{Efficient detection of second-degree variations in 2D and 3D images}},
journal = {Journal of Visual Communication and Image Representation},
year = {2001},
volume = {12},
number = {3},
pages = {255--305},
}
There is no indication that it will ever be possible to find some simple trick that miraculously solves most problems in vision. It turns out that the processing system must be able to implement a model structure, the complexity of which is directly related to the structural complexity of the problem under consideration in the external world. It has become increasingly apparent that Vision cannot be treated in isolation from the response generation, because a very high degree of integration is required between different levels of percepts and corresponding response primitives. The response to be produced at a given instance is as much dependent upon the state of the system, as the percepts impinging upon the system. In addition, it has become apparent that many classical aspects of perception, such as geometry, probably do not belong to the percept domain of a Vision system, but to the response domain. This article will focus on what are considered crucial problems in Vision for robotics for the future, rather than on the classical solutions today. It will discuss hierarchical architectures for combination of percept and response primitives. It will discuss the concept of combined percept-response invariances as important structural elements for Vision. It will be maintained that learning is essential to obtain the necessary flexibility and adaptivity. In consequence, it will be argued that invariances for the purpose of Vision are not abstractly geometrical, but derived from the percept-response interaction with the environment. The issue of information representation becomes extremely important in distributed structures of the types foreseen, where uncertainty of information has to be stated for update of models and associated data. The question of object representation is central to the paper. Equivalence is established between the representations of response, geometry and time. Finally an integrated percept-response structure is proposed for flexible response control.
@article{diva2:241534,
author = {Granlund, Gösta},
title = {{The Complexity of Vision}},
journal = {Signal Processing},
year = {1999},
volume = {74},
number = {1},
pages = {101--126},
}
As welfare diseases become more common all over the world the demand for angiography examinations is increasing rapidly. The development of advanced medical signal processing methods has with few exceptions been concentrated towards CT and MR while traditional contrast based radiology depend on methods developed for ancient photography techniques despite the fact that angiography sequences are generally recorded in digital form. This article presents a new approach for processing of angiography sequences based on advanced image processing methods. The developed algorithm automatically processes angiography sequences containing motion artifacts that cannot be processed by conventional methods like digital subtraction angiography (DSA) and pixel shift due to non uniform motions. The algorithm can in simple terms be described as an ideal pixelshift filter carrying out shifts of different directions and magnitude according to the local motions in the image. In difference to conventional methods it is fully automatic, no mask image needs to be defined and the manual pixelshift operations, which are extremely time consuming, are eliminated. The algorithm is efficient and robust and is designed to run on standard hardware of a powerful workstation which excludes the need for expensive dedicated angiography platforms. Since there is no need to make additional recordings if the patient moves, the patient is exposed to less amount of radiation and contrast fluid. The most exciting benefits by this method are, however, that it opens up new areas for contrast based angiography that are not possible to process with conventional methods e.g. nonuniform motions and multiple layers of moving tissue. Advanced image processing methods provide significantly better image quality and noise suppression but do also provide the means to compute flow velocity and visualize the flow dynamics in the arterial trees by e.g. using color. Initial tests have proven that it is possible to discriminate capillary blood flow from angiography data which opens up interesting possibilities for estimating the blood flow in the heart muscle without use of nuclear methods.
@article{diva2:241545,
author = {Knutsson, Hans and Andersson, Mats T. and Kronander, Torbjörn and Hemmendorff, Magnus},
title = {{Spatio-temporal filtering of digital angiography image data}},
journal = {Computer Methods and Programs in Biomedicine},
year = {1998},
volume = {57},
number = {1-2},
pages = {115--123},
}
This paper deals with a new framework in analyzing the formal mathematical correspondence between quantum mechanics and time-frequency representations of a signal. It is also shown that joint time-frequency distributions have a close link with Heisenberg uncertainty relations if the observables are taken as fuzzy entities. This result contradicts the arguments of Cohen [IEEE Proc. 77(7):941 (1989)] regarding the time-frequency distributions and the uncertainty relation. It is postulated that these mechanisms will be of crucial importance in highly fragmented computation structures, such as neural networks, as they may exhibit a strong mutual interaction between data and operator.
@article{diva2:241551,
author = {Roy, Sisir and Kundu, Malay K. and Granlund, Gösta H.},
title = {{Uncertainty Relations and Time-Frequency Distributions for Unsharp Observables}},
journal = {Information Sciences},
year = {1996},
volume = {89},
number = {3-4},
pages = {193--209},
}
@article{diva2:241531,
author = {Westelius, Carl-Johan and Westin, Carl-Fredrik and Knutsson, Hans},
title = {{Focus of Attention Mechanisms using Normalized Convolution}},
journal = {IEEE transactions on robotics and automation},
year = {1996},
}
In this paper, we discuss certain issues regarding robot vision. The main theme will be the importance of the choice of information representation. We will see the implications at different parts of a robot vision structure. We deal with aspects of pre-attentive versus attentive vision, control mechanisms for low level focus of attention, and representation of motion as the orientation of hyperplanes in multdimensional time-space. Issues of scale will be touched upon, and finally, a depth-from stereo algorithm based on guadrature filter phase is presented.
@article{diva2:241554,
author = {Granlund, Gösta H. and Knutsson, Hans and Westelius, Carl-Johan and Wiklund, Johan},
title = {{Issues in Robot Vision}},
journal = {Image and Vision Computing},
year = {1994},
volume = {12},
number = {3},
pages = {131--148},
}
A study of recurrent associative memories with exclusively short-range connections is presented. To increase the capacity, higher order couplings are used. We study capacity and pattern completion ability of networks consisting of units with binary (±1) output. Results show that perfect learning of random patterns is difficult for very short coupling ranges, and that the average expected capacities (allowing small errors) in these cases are much smaller than the theoretical maximum, 2 bits per coupling. However, it is also shown that by choosing ranges longer than certain limit sizes, depending on network size and order, we can come close to the theoretical capacity limit. We indicate that these limit sizes increase very slowly with net size. Thus, couplings to at least 28 and 36 neighbors suffice for second order networks with 400 and 90,000 units, respectively. From simulations it is found that even networks with coupling ranges below the limit size are able to complete input patterns with more than 10% errors. Especially remarkable is the ability to correct inputs with large local errors (part of the pattern is masked). We present a local learning algorithm for heteroassociation in recurrent networks without hidden units. The algorithm is used in a multinet system to improve pattern completion ability on correlated patterns.
@article{diva2:241555,
author = {Karlholm, Jörgen},
title = {{Associative Memories with Short--Range Higher Order Couplings}},
journal = {Neural Networks},
year = {1993},
volume = {6},
number = {3},
pages = {409--421},
}
A framework for computer-aided analysis of mammograms is described. General computer vision algorithms are combined with application specific procedures in a hierarchical fashion. The system is under development and is currently limited to detection of a few types of suspicious areas. The image features are extracted by using feature extraction methods where wavelet techniques are utilized. A low-pass pyramid representation of the image is convolved with a number of quadrature filters. The filter outputs are combined according to simple local Fourier domain models into parameters describing the local neighborhood with respect to the model. This produces estimates for each pixel describing local size, orientation, Fourier phase, and shape with confidence measures associated to each parameter. Tentative object descriptions are then extracted from the pixel-based features by application-specific procedures with knowledge of relevant structures in mammograms. The orientation, relative brightness and shape of the object are obtained by selection of the pixel feature estimates which best describe the object. The list of object descriptions is examined by procedures, where each procedure corresponds to a specific type of suspicious area, e.g. clusters of microcalcifications.
@article{diva2:241536,
author = {Bårman, Håkan and Granlund, Gösta H. and Haglund, Leif},
title = {{Feature Extraction for Computer-Aided Analysis of Mammograms}},
journal = {International journal of pattern recognition and artificial intelligence},
year = {1993},
volume = {7},
number = {6},
pages = {1339--1356},
}
The problem of detection of orientation in finite dimensional Euclidean spaces is solved in the least squares sense. In particular, the theory is developed for the case when such orientation computations are necessary at all local neighborhoods of the n-dimensional Euclidean space. Detection of orientation is shown to correspond to fitting an axis or a plane to the Fourier transform of an n-dimensional structure. The solution of this problem is related to the solution of a well-known matrix eigenvalue problem. Moreover, it is shown that the necessary computations can be performed in the spatial domain without actually doing a Fourier transformation. Along with the orientation estimate, a certainty measure, based on the error of the fit, is proposed. Two applications in image analysis are considered: texture segmentation and optical flow. An implementation for 2-D (texture features) as well as 3-D (optical flow) is presented. In the case of 2-D, the method exploits the properties of the complex number field to by-pass the eigenvalue analysis, improving the speed and the numerical stability of the method. The theory is verified by experiments which confirm accurate orientation estimates and reliable certainty measures in the presence of noise. The comparative results indicate that the proposed theory produces algorithms computing robust texture features as well as optical flow. The computations are highly parallelizable and can be used in realtime image analysis since they utilize only elementary functions in a closed form (up to dimension 4) and Cartesian separable convolutions.
@article{diva2:241552,
author = {Bigun, Josef and Granlund, Gösta H. and Wiklund, Johan},
title = {{Multidimensional orientation estimation with applications to texture analysis and optical flow}},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {1991},
volume = {13},
number = {8},
pages = {775--790},
}
A new low-level vision primitive based on logarithmic spirals is presented for various image processing tasks. The detection of such primitives is equivalent to detection of lines and edges in another coordinate system which has been used to model the mapping of the visual field to the striate cortex. Algorithms detecting the proposed primitives and pointing out a matched subclass are presented along with necessary theory. As a result, if the local structure is describable by the proposed primitives then a certainty parameter based on a well-defined mismatch (error) function will indicate this. Moreover, the best fit of a subclass of the proposed primitives in the least squares sense will be computed. The resulting images are unthresholded. They are computed by means of simple convolutions and pixelwise arithmetic operations which make the algorithms suitable for real time image processing applications. Since the resulting images contain information about the local structure, they can be used as feature images in applications like remote sensing, texture analysis, and object recognition. Experimental results on the latter including synthetic as well as natural images are presented along with noise sensitivity tests. The results exhibit good detection properties for the subclasses of the modeled primitives along with uniform and reliable behavior of the corresponding certainty measures.
@article{diva2:241546,
author = {Bigün, Josef},
title = {{A Structure Feature for Some Image Processing Applications Based on Spiral Functions}},
journal = {Computer Vision, Graphics and Image Processing},
year = {1990},
volume = {51},
number = {2},
pages = {166--194},
}
Recent physiological research has indicated that the visual system makes use of units responsive to Gabor signals in the analysis of visual stimuli. Such functions effect a tradeoff between pure spatial- and frequency-domain descriptions. The authors explain the use of such representations in vision, considered as a process in inference from the retinal signals to a symbolic description. The appropriate mathematical structure for the inference is that of the subspaces of the signal vector space, a feature which it shares with quantum mechanics. The theory is derived directly from the fundamental constraints on visual inference. It is then shown to be consistent with many of the known properties of the visual system. In particular, a major feature of the inference system-the occurrence of interference effects-has already been observed in visual system operation
@article{diva2:241548,
author = {Wilson, Roland and Knutsson, Hans},
title = {{Uncertainty and Inference in the Visual System}},
journal = {IEEE Transactions on Systems, Man and Cybernetics},
year = {1988},
volume = {18},
number = {2},
pages = {305--312},
}
Phototransduction in rod cells is likely to involve an intracellular messenger system that links the absorption of light by rhodopsin to a change in membrane conductance. The direct effect of guanosine 3',5'-monophosphate (cGMP) on excised patches of rod outer segment membrane strongly supports a role for cGMP as an intracellular messenger in phototransduction. It is reported here that magnesium and calcium directly affect the conductance of excised patches of rod membrane in the absence of cGMP and that magnesium, applied to intact rod cells, blocks a component of the cellular light response. The divalent cation-suppressed conductance in excised patches showed outward rectification and cation-selective permeability resembling those of the light-suppressed conductance measured from the intact rod cell. The divalent cation-suppressed conductance was partly blocked by a concentration of the pharmacological agent L-cis-diltiazem that blocked all of the cGMP-activated conductance. Divalent cations may act, together with cGMP, as an intracellular messenger system that mediates the light response of the rod photoreceptor cell.
@article{diva2:241547,
author = {Stern, J. H. and Knutsson, Hans and MacLeish, P. R.},
title = {{Divalent Cations Directly Affect the Conductance of Excised Patches of Rod Photoreceptor Membrane.}},
journal = {Science},
year = {1987},
volume = {236},
number = {4809},
pages = {1674--1678},
}
@article{diva2:241537,
author = {MacLeish, P. R. and Knutsson, Hans and Stern, J. H.},
title = {{The Control of the Rod Outer Segment Conductance by Cyclic-GMP and Divalent Cations.}},
journal = {Photobiochemistry and Photobiophysics},
year = {1986},
volume = {13},
pages = {359--372},
}
A new device for generation of electromagnetic fields at extra low frequencies, to be used in fracture treatment, is described. The device involves a coil and a battery powered noise-generator. An alternating magnetic field of 4 × 10−4T (4 Gauss) (RMS value) with a frequency range 1–1000 Hz is generated. Results from a controlled randomized study of fresh fractures have shown significant differences (p < 0.01) between the treated group and the control group. The results are encouraging and motivate further investigations with this method.
@article{diva2:241539,
author = {Wahlström, Ola and Knutsson, Hans},
title = {{A Device for Generation of Electromagnetic Fields of Extremely Low Frequency}},
journal = {Journal of Biomedical Engineering},
year = {1984},
volume = {6},
number = {4},
pages = {293--296},
}
A new predictive coder, based on an estimation method which adapts to line and edge features in images, is described. Quantization of the prediction error is performed by a two-level adaptive scheme: an adaptive transform coder, and a threshold coding in both transform and spatial domains. Control information, which determines the behavior of the predictor, is quantized using a simple variable rate technique. The results are improved by pre- and post-filtering using a related noncausal form of the estimator. Acceptable images have been produced in this way at bit rates of less than 0.5 bit/pixel.
@article{diva2:241549,
author = {Wilson, Roland and Knutsson, Hans and Granlund, Gösta H.},
title = {{Anisotropic Non-Stationary Image Estimation and its Applications:
Part II. Predictive Image Coding}},
journal = {IEEE Transactions on Communications},
year = {1983},
volume = {31},
number = {3},
pages = {398--406},
}
A new form of image estimator, which takes account of linear features, is derived using a signal equivalent formulation. The estimator is shown to be a nonstationary linear combination of three stationary estimators. The relation of the estimator to human visual physiology is discussed. A method for estimating the nonstationary control information is described and shown to be effective when the estimation is made from noisy data. A suboptimal approach which is computationally less demanding is presented and used in the restoration of a variety of images corrupted by additive white noise. The results show that the method can improve the quality of noisy images even when the signal-to-noise ratio is very low.
@article{diva2:241544,
author = {Knutsson, Hans and Wilson, Roland and Granlund, Gösta H.},
title = {{Anisotropic Non-Stationary Image Estimation and its Applications:
Part I. Restoration of Noisy Images}},
journal = {IEEE Transactions on Communications},
year = {1983},
volume = {COM--31},
number = {3},
pages = {388--397},
}
In a special radiographic process, ectomography, an image of a slice is produced by simple summation of a set of specially filtered component images, of which each represents one of at least 60 different projections of the object. After being digitized, they are stored, filtered, and summed in a computer. Images representing any slice of any thickness in the object may be produced from the same set of component images. All details within the slice are pictured correctly while details outside are almost completely eliminated.
@article{diva2:241543,
author = {Petersson, Christer U. and Edholm, Paul and Granlund, Gösta H. and Knutsson, Hans E.},
title = {{Ectomography. A New Radiographic Reconstruction Method:
II. Computer Simulated Experiments}},
journal = {IEEE Transactions on Biomedical Engineering},
year = {1980},
volume = {BME--27},
number = {11},
pages = {649--655},
}
The mathematical basis is described of a new radiographic method by which an arbitrarily thick layer of the patient may be reconstructed. The reconstruction is performed from at least 60 images of the volume under examination. Each of these images, which have to be in digital form, is subjected to a special filtration process of its spatial frequencies. The combination of all the images will form the resulting image of the layer--the ectomogram. The method has been analysed and tested in experiments simulated with a computer.
@article{diva2:241542,
author = {Edholm, Paul and Granlund, Gösta and Knutsson, Hans and Petersson, C.},
title = {{Ectomography:
A New Radiographic Method for Reproducing a Selected Slice of Varying Thickness}},
journal = {Acta Radiologica},
year = {1980},
volume = {21},
number = {4},
pages = {433--442},
}
Radiographic technology has advanced considerably during the last years with the advent of reconstruction techniques allowing visualization of slices through the body. In spite of the advantage of computed tomography compared to conventional radiographic methods, there are still some shortcomings with the method If a different section of the body is desired, another recording has to be made, the width of the dice reconstructed is fixed, and a full 1800 view angle is required.
@article{diva2:241533,
author = {Knutsson, Hans E. and Edholm, Paul and Granlund, Gösta H. and Petersson, Christer U.},
title = {{Ectomography. A New Radiographic Reconstruction Method:
I. Theory and Error Estimates}},
journal = {IEEE Transactions on Biomedical Engineering},
year = {1980},
volume = {BME--27},
number = {11},
pages = {640--645},
}
The problem of finding a general, parallel, and hierarchical operator for picture processing is considered. An operator is defined which at different levels can detect and describe structure as opposed to uniformity within local regions, whatever structure and uniformity may imply at a particular level. The operator performs a mapping from one complex field to another. The important characteristic of this approach is the use of complex fields which allows a global-to-local feedback. In the transformation process the image is simplified. A Fourier implementation of the operator is described and a new transform is defined. The operators become increasingly global on higher levels in order to include adjacent high-level features. A hierarchical structure of such transformations gives a sequential description of structure over increasingly larger regions of the image. The processed information at different levels can be used as input to a classifier. Examples are given of processing results.
@article{diva2:241532,
author = {Granlund, Gösta H.},
title = {{In Search of a General Picture Processing Operator}},
journal = {Computer Graphics and Image Processing},
year = {1978},
volume = {8},
number = {2},
pages = {155--173},
}
The advent of new stains for chromosomes has increased the possibility of implementing useful automated chromosome analysis. The case with which chromosomes can now be recognized makes it possible to perform detailed statistical analysis of the chromosomes of an individual. This paper describes methods for assembling chromosome information from several cells in such a way that accidental variations due to preparation, etc. can be eliminated and an undistorted set of characteristics of the chromosome complement can be established. This set of characteristics can then be compared with various references, and statements can be made concerning the relationships between variations in the chromosome complement and genetic traits. These same methods can be employed in multiple-cell karyotyping to circumvent the classical problem of touching and overlapping chromosomes. The methods also allow one to achieve very reliable descriptions of the chromosome complement. The importance of appropriate descriptors of the chromosomes is illustrated.
@article{diva2:241535,
author = {Granlund, Gösta H.},
title = {{Statistical Analysis of Chromosome Characteristics}},
journal = {Pattern Recognition},
year = {1974},
volume = {6},
number = {2},
pages = {115--126},
}
The advent of new stains for chromosomes has increased the possibilities that useful automated chromosome analysis can be implemented. The search for appropriate descriptors to use in this process is an important task. Data compression using integrated intensity and density profiles has already shown itself to be valuable. A method is proposed in this paper to describe these profiles as a sum of distribution functions. Every distribution function can be described by a triplet stating peak height, position, and width and it appears that these parameters are directly related to physical processes. The importance of such parameters in statistical chromosome analysis is emphasized. A classification experiment is described in which 240 chromosomes 1 to 22, X and Y have been classified with an accuracy of 96%.
@article{diva2:241556,
author = {Granlund, Gösta H.},
title = {{The Use of Distribution Functions to Describe Integrated Density Profiles of Human Chromosomes}},
journal = {Journal of Theoretical Biology},
year = {1973},
volume = {40},
number = {3},
pages = {573--589},
}
@article{diva2:241553,
author = {Granlund, Gösta H.},
title = {{Fourier Preprocessing for Hand Print Character Recognition}},
journal = {IEEE Transactions on Computers},
year = {1972},
volume = {C--21},
number = {2},
pages = {195--201},
}
Books
Under the title "Probabilistic and Biologically Inspired Feature Representations," this text collects a substantial amount of work on the topic of channel representations. Channel representations are a biologically motivated, wavelet-like approach to visual feature descriptors: they are local and compact, they form a computational framework, and the represented information can be reconstructed. The first property is shared with many histogram- and signature-based descriptors, the latter property with the related concept of population codes. In their unique combination of properties, channel representations become a visual Swiss army knife—they can be used for image enhancement, visual object tracking, as 2D and 3D descriptors, and for pose estimation. In the chapters of this text, the framework of channel representations will be introduced and its attributes will be elaborated, as well as further insight into its probabilistic modeling and algorithmic implementation will be given. Channel representations are a useful toolbox to represent visual information for machine learning, as they establish a generic way to compute popular descriptors such as HOG, SIFT, and SHOT. Even in an age of deep learning, they provide a good compromise between hand-designed descriptors and a-priori structureless feature spaces as seen in the layers of deep networks.
@book{diva2:1211520,
author = {Felsberg, Michael},
title = {{Probabilistic and biologically inspired feature representations}},
publisher = {Morgan \& Claypool Publishers},
year = {2018},
address = {San Rafael},
}
We, the Guest Editors of this special issue of Pattern Recognition Letters are pleased to share these contributions with you. The papers included here are based on work from the 22nd International Conference on Pattern Recognition (IAPR) in Stockholm, Sweden, held August 24–28, 2014. The papers selected for this special issue were those winning one of the IAPR awards, as well as one paper by a former student of the winner of the KS Fu Prize, Prof. Jitendra Malik. Taken together, this body of work represents some of the finest research being conducted by the IAPR community worldwide, it builds on a rich legacy of accomplishment by the entire community, and it offers a view to the future, to where we are going as a scientific community.
For each of the award-winning papers, the authors were asked to revise and extend their contributions to full journal length and to provide true added value vis-à-vis the original conference submission. In some cases, the authors elected to modify the titles slightly, and in some cases the list of authors has also been modified. The resulting manuscripts were sent out for full review by a different set of referees than those who reviewed the conference versions. The process, including required revisions, was in accordance with the standing editorial policy of Pattern Recognition Letters, resulting in the final versions accepted and appearing here. These are thoroughly vetted, high-caliber scientific contributions.
It has been our honor to serve as Guest Editors for this special issue. We would like to thank the Editors of Pattern Recognition Letters for allowing us this opportunity. We are especially grateful to Dr. Gabriella Sanniti di Baja for her enthusiasm, support, and her willingness to keep prodding us along to bring the special issue through to completion. We would also like to thank all of those who reviewed the papers, both originally for the conference and subsequently for the journal, and those who served on the ICPR awards and KS Fu Prize committees.
Finally, we express our heartfelt gratitude to all of the authors for taking the time to prepare these versions for our collective enlightenment, sharing their knowledge, innovation, and discoveries with the rest of us.
@book{diva2:916338,
editor = {Chellappa, Rama and Heyden, Anders and Laurendeau, Denis and Felsberg, Michael and Borga, Magnus},
title = {{Special issue on ICPR 2014 awarded papers}},
publisher = {Elsevier},
year = {2016},
}
Signal Processing for Computer Vision is a unique and thorough treatment of the signal processing aspects of filters and operators for low-level computer vision.
Computer vision has progressed considerably over recent years. From methods only applicable to simple images, it has developed to deal with increasingly complex scenes, volumes and time sequences. A substantial part of this book deals with the problem of designing models that can be used for several purposes within computer vision. These partial models have some general properties of invariance generation and generality in model generation.
Signal Processing for Computer Vision is the first book to give a unified treatment of representation and filtering of higher order data, such as vectors and tensors in multidimensional space. Included is a systematic organisation for the implementation of complex models in a hierarchical modular structure and novel material on adaptive filtering using tensor data representation.
Signal Processing for Computer Vision is intended for final year undergraduate and graduate students as well as engineers and researchers in the field of computer vision and image processing.
@book{diva2:302469,
editor = {Granlund, Gösta and Knutsson, Hans},
title = {{Signal Processing for Computer Vision}},
publisher = {Kluwer},
year = {1995},
address = {Dordrecht},
}
Book chapters
The choice of the material base to which the material decomposition is performed in dual-energy computed tomography may affect the quality of reconstructed images. Resulting inaccuracies may lower their diagnostic value, or if the data are used for radiation treatment planning, the accuracy of such plans. The aim of this work is to investigate how the commonly used (water, bone) (WB), (water, iodine) (WI), and (approximate photoelectric effect, Compton scattering) (PC) doublets affect the reconstructed linear attenuation coefficient in the case of the Alvarez–Macovski (AM) method. The performance of this method is also compared to the performance of the dual-energy iterative reconstruction algorithm DIRA. In both cases, the study is performed using simulations.
The results show that the PC and WB doublets accurately predicted the linear attenuation coefficient (LAC) values for human tissues and elements with Z = 1, …, 20, in the 20–150 keV range, though there was a small (<5% discrepancy in the 20–35 keV range. The WI doublet did not represent the tissues as well as PC and WB; the largest discrepancies (>50% in some cases) were in the 20–40 keV range.
LACs reconstructed with the AM and DIRA followed this trend. AM produced artifacts when iodine was present in the phantom together with human tissues since AM can only work with one doublet at a time. It was shown that these artifacts could be avoided with DIRA using different doublets at different spatial positions, i.e., WB for soft and bone tissue and WI for the iodine solution.
@incollection{diva2:1761777,
author = {Magnusson, Maria and Alm Carlsson, Gudrun and Sandborg, Michael and Carlsson Tedgren, Åsa and Malusek, Alexandr},
title = {{On the Choice of Base Materials for Alvarez--Macovski and DIRA Dual-energy Reconstruction Algorithms in CT}},
booktitle = {Photon Counting Computed Tomography},
year = {2023},
pages = {153--175},
publisher = {Springer},
address = {Cham},
}
@incollection{diva2:1696385,
author = {Felsberg, Michael},
title = {{Visual tracking:
Tracking in scenes containing multiple moving objects}},
booktitle = {Advanced Methods and Deep Learning in Computer Vision},
year = {2022},
pages = {305--336},
publisher = {Elsevier},
address = {London},
}
Scene reconstruction, i.e. the process of creating a 3D representation (mesh) of some real world scene, has recently become easier with the advent of cheap RGB-D sensors (e.g. the Microsoft Kinect).
Many such sensors use rolling shutter cameras, which produce geometrically distorted images when they are moving. To mitigate these rolling shutter distortions we propose a method that uses an attached gyroscope to rectify the depth scans.We also present a simple scheme to calibrate the relative pose and time synchronization between the gyro and a rolling shutter RGB-D sensor.
For scene reconstruction we use the Kinect Fusion algorithm to produce meshes. We create meshes from both raw and rectified depth scans, and these are then compared to a ground truth mesh. The types of motion we investigate are: pan, tilt and wobble (shaking) motions.
As our method relies on gyroscope readings, the amount of computations required is negligible compared to the cost of running Kinect Fusion.
This chapter is an extension of a paper at the IEEE Workshop on Robot Vision [10]. Compared to that paper, we have improved the rectification to also correct for lens distortion, and use a coarse-to-fine search to find the time shift more quicky.We have extended our experiments to also investigate the effects of lens distortion, and to use more accurate ground truth. The experiments demonstrate that correction of rolling shutter effects yields a larger improvement of the 3D model than correction for lens distortion.
@incollection{diva2:789457,
author = {Ovr\'{e}n, Hannes and Forss\'{e}n, Per-Erik and Törnqvist, David},
title = {{Improving RGB-D Scene Reconstruction using Rolling Shutter Rectification}},
booktitle = {New Development in Robot Vision},
year = {2015},
pages = {55--71},
publisher = {Springer Berlin/Heidelberg},
}
Online learning of vision-based robot control requires appropriate activation strategies during operation. In this chapter we present such a learning approach with applications to two areas of vision-based robot control. In the first setting, selfevaluation is possible for the learning system and the system autonomously switches to learning mode for producing the necessary training data by exploration. The other application is in a setting where external information is required for determining the correctness of an action. Therefore, an operator provides training data when required, leading to an automatic mode switch to online learning from demonstration. In experiments for the first setting, the system is able to autonomously learn the inverse kinematics of a robotic arm. We propose improvements producing more informative training data compared to random exploration. This reduces training time and limits learning to regions where the learnt mapping is used. The learnt region is extended autonomously on demand. In experiments for the second setting, we present an autonomous driving system learning a mapping from visual input to control signals, which is trained by manually steering the robot. After the initial training period, the system seamlessly continues autonomously. Manual control can be taken back at any time for providing additional training.
@incollection{diva2:750041,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Online Learning of Vision-Based Robot Control during Autonomous Operation}},
booktitle = {New Development in Robot Vision},
year = {2015},
pages = {137--156},
publisher = {Springer Berlin/Heidelberg},
}
This volume does much more than survey modern advanced color processing. Starting with a historical perspective on ways we have classified color, it sets out the latest numerical techniques for analyzing and processing colors, the leading edge in our search to accurately record and print what we see. The human eye perceives only a fraction of available light wavelengths, yet we live in a multicolor world of myriad shining hues. Colors rich in metaphorical associations make us "purple with rage" or "green with envy" and cause us to "see red." Defining colors has been the work of centuries, culminating in today's complex mathematical coding that nonetheless remains a work in progress: only recently have we possessed the computing capacity to process the algebraic matrices that reproduce color more accurately. With chapters on dihedral color and image spectrometers, this book provides technicians and researchers with the knowledge they need to grasp the intricacies of today's color imaging.
@incollection{diva2:609832,
author = {Lenz, Reiner and Zografos, Vasileios and Solli, Martin},
title = {{Dihedral Color Filtering}},
booktitle = {Advanced Color Image Processing and Analysis},
year = {2013},
pages = {119--145},
publisher = {Springer},
}
This book presents a mathematical methodology for image analysis tasks at the edge of current research, including anisotropic diffusion filtering of tensor fields. Instead of specific applications, it explores methodological structures on which they are built.
@incollection{diva2:491401,
author = {Felsberg, Michael},
title = {{Adaptive Filtering using Channel Representations}},
booktitle = {Mathematical Methods for Signal and Image Analysis and Representation},
year = {2012},
pages = {31--48},
publisher = {Springer London},
}
This Chapter is devoted to the cooperation of multiple UAVs for environment perception. First, probabilistic methods for multi-UAV cooperative perception are analyzed. Then, the problem of multi-UAV detection, localization and tracking is described, and local image processing techniques are presented. Then, the Chapter shows two approaches based on the Information Filter and on evidence grid representations.
@incollection{diva2:275335,
author = {Merino, Luis and Caballero, Fernando and Ferruz, Joaquín and Wiklund, Johan and Forssen, Per-Erik and Ollero, Anibal},
title = {{Multi-UAV Cooperative Perception Techniques}},
booktitle = {Multiple Heterogeneous Unmanned Aerial Vehicles},
year = {2007},
pages = {67--110},
publisher = {Springer},
address = {Berlin / Heidelberg},
}
This Chapter presents a vision-based method for unmanned aerial vehicle (UAV) motion estimation that uses as input an image motion field obtained from matches of point-like features. The Chapter enhances visionbased techniques developed for single UAV localization and demonstrates how they can be modified to deal with the problem of multi-UAV relative position estimation. The proposed approach is built upon the assumption that if different UAVs identify, using their cameras, common objects in a scene, the relative pose displacement between the UAVs can be computed from these correspondences under certain assumptions. However, although point-like features are suitable for local UAV motion estimation, finding matches between images collected using different cameras is a difficult task that may be overcome using blob features. Results justify the proposed approach.
@incollection{diva2:273757,
author = {Merino, Luis and Caballero, Fernando and Forss\'{e}n, Per-Erik and Wiklund, Johan and Ferruz, Joaquín and Martinez-de Dios, Jose Ramiro and Moe, Anders and Nordberg, Klas and Ollero, Anibal},
title = {{Single and Multi-UAV Relative Position Estimation Based on Natural Landmarks}},
booktitle = {Advances in Unmanned Aerial Vehicles},
year = {2007},
pages = {267--307},
publisher = {Springer},
address = {Netherlands},
}
This volume is a post-event proceedings volume and contains selected papers based on the presentations given, and the lively discussions that ensued, during a seminar held in Dagstuhl Castle, Germany, in October 2003. Co-sponsored by ECVision, the cognitive vision network of excellence, it was organized to further strengthen cooperation between research groups from different countries working in the field of cognitive vision systems.
@incollection{diva2:241587,
author = {Granlund, Gösta},
title = {{Organization of Architectures for Cognitive Vision Systems}},
booktitle = {Cognitive Vision Systems},
year = {2006},
pages = {37--55},
publisher = {Springer},
}
This paper presents novel results from an ongoing feasibility study of fully 3D X-ray scanning of Pinus Sylvestris (Scots Pine) logs. Logs are assumed to be translated through two identical and static cone beam systems with the beams rotated 90degrees relative to eachother, providing a dual set of 2D-projections. For reasons of both cost and speed, each 2D-detector in these two systems consists of a limited number of line detectors. The quality of the reconstructed images is far from perfect, due to sparse detector data and missing projection angles. In spite of this we show that by employing a shape- and direction discriminative technique based on second derivatives, we are able to enhance knot-like features in these data. In the enhanced images it is then possible to detect and localize the pith for each whorl of knots, and subsequently also to perform a full segmentation of the knots in the heartwood.
@incollection{diva2:269474,
author = {Flood, Katarina and Danielsson, Per-Erik and Magnusson Seger, Maria},
title = {{On 3D scanning, reconstruction, enhancement, and segmentation of logs}},
booktitle = {Image Analysis},
year = {2003},
pages = {733--740},
publisher = {Springer Berlin/Heidelberg},
}
This paper presents a novel two-frame motion estimation algorithm. The first step is to approximate each neighborhood of both frames by quadratic polynomials, which can be done efficiently using the polynomial expansion transform. From observing how an exact polynomial transforms under translation a method to estimate displacement fields from the polynomial expansion coefficients is derived and after a series of refinements leads to a robust algorithm. Evaluation on the Yosemite sequence shows good results.
@incollection{diva2:269471,
author = {Farnebäck, Gunnar},
title = {{Two-frame motion estimation based on polynomial expansion}},
booktitle = {Image Analysis},
year = {2003},
pages = {363--370},
publisher = {Springer Berlin/Heidelberg},
}
In most, if not all fast marching methods published hitherto, the input,cost function and the output arrival time are sampled on exactly the same grid. But since the input data samples are differences of the output samples. we found it natural to separate the input and output grid half a sampling unit in all coordinates (two or three). We also employ 8-neighborhood (26-neighborhood in the 3D-case) in the basic updating step of the algorithm. Some simple numerical experiments verify that the modified method improves the accuracy considerably. However, we also feel the modified method leads itself more naturally to image processing applications like tracking and segmentation.
@incollection{diva2:269477,
author = {Danielsson, Per-Erik and Lin, Qingfen},
title = {{A modified fast marching method}},
booktitle = {Image Analysis},
year = {2003},
pages = {1154--1161},
publisher = {Springer Berlin/Heidelberg},
}
Perceptual experiments indicate that corners and curvature are very important features in the process of recognition. This paper presents a new method to efficiently detect rotational symmetries, which describe complex curvature such as corners, circles, star- and spiral patterns. The method is designed to give selective and sparse responses. It works in three steps, first extract local orientation from a gray-scale or color image, second correlate the orientation image with rotational symmetry filters and third let the filter responses inhibit each other in order to get more selective responses, The correlations can be made efficient by separating the 2D-filters into a small number of 1D-filters. These symmetries can serve as feature points at a high abstraction level for use in hierarchical matching structures for 3D-estimation, object recognition, etc.
@incollection{diva2:269988,
author = {Johansson, Björn and Granlund, Gösta},
title = {{Fast selective detection of rotational symmetries using normalized inhibition}},
booktitle = {Proceedings of the 6th European Conference on Computer Vision, Dublin, Ireland, June 26 - July 1, Part I},
year = {2000},
pages = {871--887},
publisher = {Springer},
address = {London},
}
Most of the processing in vision today uses spatially invariant operations. This gives efficient and compact computing structures, with the conventional convenient separation between data and operations. This also goes well with conventional Cartesian representation of data. Currently, there is a trend towards context dependent processing in various forms. This implies that operations will no longer be spatially invariant, but vary over the image dependent upon the image content. There are many ways in which such a contextual control can be implemented. Mechanisms can be added for the modification of operator behavior within the conventional computing structure. This has been done e.g. for the implementation of adaptive filtering. In order to obtain sufficient flexibility and power in the computing structure, it is necessary to go further than that. To achieve sufficiently good adaptivity, it is necessary to ensure that sufficiently complex control strategies can be represented. It is becoming increasingly apparent that this can not be achieved through prescription or program specification of rules. The reason being that these rules will be dauntingly complex and can not be be dealt with in sufficient detail. At the same time that we require the implementation of a spatially variant processing, this implies the requirement for a spatially variant information representation. Otherwise a sufficiently effective and flexible contextual control can not be implemented. This paper outlines a new structure for effective space variant processing. It utilises a new type of localized information representation, which can be viewed as outputs from band pass filters such as wavelets. A unique and important feature is that convex regions can be built up from a single layer of associating nodes. The specification of operations is made through learning or action controlled association.
@incollection{diva2:246014,
author = {Granlund, Gösta H.},
title = {{An Associative Perception-Action Structure using a Localized Space Variant Information Representation}},
booktitle = {Algebraic Frames for the Perception-Action Cycle},
year = {2000},
pages = {48--68},
publisher = {Springer},
}
@incollection{diva2:275341,
author = {Ulvklo, Morgan and Granlund, Gösta H. and Knutsson, Hans},
title = {{Texture Gradient in Sparse Texture Fields}},
booktitle = {Theory and Applications of Image Analysis II},
year = {1996},
publisher = {World Scientific},
}
This chapter deals with texture analysis, an important application of the methods described in earlier chapters. It introduces ideas from preattentive vision, which gives clues for the extraction of texture primitives. There is also a discussion on how to handle features whose significance varies with spatial position.
@incollection{diva2:405418,
author = {Ulvklo, Morgan},
title = {{Texture Analysis}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {399--418},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter is not original, but presents methods for linear classification in the tradition of N. J. Nilsson as well as R. O. Duda and P. E. Hart. Part of the motivation for including this well-known material is to allow the vision structure to be brought to a logical conclusion in which feature properties are combined to form responses or class statements. Another motivation developed here is to display the similarity in structure between convolution operations and linear discriminant functions. This brings all operations for feature extraction and classification to the use of a common component, linear discriminants. This is also illustrated in the form of perceptrons, which allows a transition to the modern theory of neural networks.
@incollection{diva2:405416,
author = {Granlund, Gösta H. and Karlholm, Jörgen},
title = {{Classification and Response Generation}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {367--397},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter discusses techniques for processing of higher order data such as vector and tensor fields. As abstraction implies a more complex descriptor, developing methods for processing of higher order data is an essential part of any hierarchical or layered approach to vision. The chapter focuses on models for extracting local symmetries and discontinuities in higher order fields.
@incollection{diva2:405412,
author = {Westin, Carl-Fredrik},
title = {{Vector and Tensor Field Filtering}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {343--365},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter presents a computationally efficient technique for adaptive filtering of n-dimensional signals.
The approach is based on the local signal description given by the orientation tensor discussed in Chapter 6. The adaptive filter output is synthesized as a tensor-controlled weighted summation of shift-invariant filter outputs. Several examples of adaptive filtering in two and three dimensions are given. The chapter contains original results on the extension of the techniques to n dimensions
@incollection{diva2:405410,
author = {Knutsson, Hans and Haglund, Leif},
title = {{Adaptive Filtering}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {309--342},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter considers what the important properties are for an information representation to behave well in various transformations. There is an extended discussion on the necessity to separate between class membership and certainty of a signal.
@incollection{diva2:405408,
author = {Westin, Carl-Fredrik and Knutsson, Hans},
title = {{Representation and Averaging}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {297--308},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter deals with the estimation of local frequency and bandwidth. Local frequency is an important concept which provides an indication of the appropriate range of scales for subsequent analysis. A number of one-dimensional and two-dimensional examples of local frequency and bandwidth estimation are given.
@incollection{diva2:405405,
author = {Knutsson, Hans and Westin, Carl-Fredrik},
title = {{Local Frequency}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {279--295},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter deals with the concept of phase and phase representation in multiple dimensions. Phase is an important concept, which emerges in several contexts in vision. The chapter provides a detailed treatment of phase properties in various situations and deals with how to maintain continuity in phase representation. An example is given of how local phase in difference scales can be used for disparity estimation. The chapter contains original material on the representation of phase for signals in three dimensions and higher dimensions.
@incollection{diva2:405404,
author = {Westelius, Carl-Johan},
title = {{Local Phase Estimation}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {259--278},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter introduces the use of tensors in estimation of local structure and orientation. The tensor representation is shown to be crucial to unambiguous and continuous representation of local orientation in multiple dimensions. In addition to orientation the tensor representation also conveys the degree and type of local anisotropy. The orientation estimation approach is developed in detail for two, three and four dimensions and is shown to be extendable to higher dimensions. Examples and performance measures are given for processing of images, volumes and time sequences.
@incollection{diva2:405403,
author = {Knutsson, Hans and Andersson, Mats and Haglund, Leif and Wiklund, Johan},
title = {{Orientation and Velocity}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {219--258},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter presents a method for obtaining an optimal n-dimensional set of filter coefficients for any given frequency response. An optimality criterion is defined that enables different frequencies to be given individual weights. Appropriate forms of frequency weight functions are discussed and a number of optimization examples are given.
@incollection{diva2:405402,
author = {Knutsson, Hans},
title = {{Kernel Optimizatioin}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {199--218},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter is on Fourier methods, with a particularemphasis on definitions and theorems essential to the understanding offiltering procedures in multi-dimensional spaces. This is a centralissue in computer vision.
@incollection{diva2:405398,
author = {Nordberg, Klas},
title = {{Fourier Transforms}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {117--197},
publisher = {Kluver},
address = {Dordrecht},
}
This chapter gives an introductory treatment of operations andrepresentations for low-level features in multi-dimensional spaces. Animportant issue is how to combine contributions from several filtersto provide robust statements in accordance with certain low-levelmodels. This chapter gives an introduction to the problems ofunambiguous mappings in multi-dimensional spaces.
@incollection{diva2:405376,
author = {Granlund, Gösta H. and Wiklund, Johan},
title = {{Low Level Operations}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {97--116},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter givesan overview of important biological vision mechanisms. Although agreat deal is known about neural processing of visual information,most essential questions about biological vision remain as yetunanswered. Nonetheless, the knowledge available has already provideduseful guidance to the organization of effective machine visionsystems.
@incollection{diva2:405374,
author = {Granlund, Gösta H. and Karlholm, Jörgen and Westelius, Carl-Johan and Westin, Carl-Fredrik},
title = {{Biological Vision}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {41--95},
publisher = {Kluwer},
address = {Dordrecht},
}
This chapter establishes the motivation anduse of hierarchical operation structures to provide a systematicorganization for the implementation of complicated models. The chaptergives an intuitive treatment of most aspects that are considered inthe later chapters.
@incollection{diva2:405373,
author = {Granlund, Gösta H.},
title = {{Introduction and Overview}},
booktitle = {Signal Processing for Computer Vision},
year = {1995},
pages = {1--39},
publisher = {Kluwer},
address = {Dordrecht},
}
@incollection{diva2:275350,
author = {Westelius, Carl-Johan and Knutsson, Hans and Granlund, Gösta},
title = {{Low Level Focus of Attention Mechanisms}},
booktitle = {Vision as Process},
year = {1995},
publisher = {Springer},
address = {Berlin},
}
@incollection{diva2:275348,
author = {Westin, Carl-Fredrik and Knutsson, Hans},
title = {{Line Extraction using Tensors}},
booktitle = {Vision as Process},
year = {1995},
publisher = {Springer},
address = {Berlin},
}
We apply the 3D-orientation tensor representation to construct an object tracking algorithm. 2D-line normal flow is estimated by computing the eigenvector associated with the largest eigenvalue of 3D (two spatial dimensions plus time) tensors with a planar structure. Object's true 2D velocity is computed by averaging tensors with consistent normal flows, generating a 3D line represention that corresponds to a 2D point in motion. Flow induced by camera rotation is compensated for by ignoring points with velocity consistent with the ego-rotation. A region-of-interest growing process based on motion consistency generates estimates of object size and position. Introduction The literature on optical flow estimation is wast. Descriptions and performance studies of a number of different techniques are given in and the monographs by Fleet and Jahne. We will only briefly describe the particular methods used in the present study. Details on the tensor field represention a...
@incollection{diva2:275346,
author = {Karlholm, Jörgen and Westelius, Carl-Johan and Westin, Carl-Fredrik and Knutsson, Hans},
title = {{Object Tracking Based on the Orientation Tensor Concept}},
booktitle = {Theory and Applications of Image Analysis II},
year = {1995},
pages = {267--278},
publisher = {World Scientific Publishing},
address = {Singapore},
}
@incollection{diva2:275344,
author = {Westin, Carl-Fredrik and Granlund, Gösta and Knutsson, Hans},
title = {{Advanced Image Processing:
Introduction and Background}},
booktitle = {Vision as Process},
year = {1995},
publisher = {Springer},
address = {Berlin},
}
The problem of estimating depth information from two or more images of a scene is one which has received considerable attention over the years and a wide variety of methods have been proposed to solve it [Barnard and Fichsler, 1982; Fleck, 1991]. Methods based on correlation and methods using some form of feature matching between the images have found most widespread use. Of these, the latter have attracted increasing attention since the work of Marr [Marr, 1982], in which the features are zero-crossings on varying scales. These methods share an underlying basis of spatial domain operations.
In recent years, however, increasing interest has been shown in computational models of vision based primarily on a localized frequency domain representation - the Gabor representation [Gabor, 1946; Adelson and Bergen, 1985], first suggested in the context of computer vision by Granlund [Granlund, 1978].
@incollection{diva2:274818,
author = {Westelius, Carl-Johan and Knutsson, Hans and Wiklund, Johan and Westin, Carl-Fredrik},
title = {{Phase-based Disparity Estimation}},
booktitle = {Vision as Process},
year = {1995},
pages = {157--178},
publisher = {Springer-Verlag},
address = {Berlin},
}
A framework for computer-aided analysis of mammograms is described. General computer vision algorithms are combined with application specific procedures in a hierarchical fashion. The system is under development and is currently limited to detection of a few types of suspicious areas. The image features are extracted by using feature extraction methods where wavelet techniques are utilized. A low-pass pyramid representation of the image is convolved with a number of quadrature filters. The filter outputs are combined according to simple local Fourier domain models into parameters describing the local neighborhood with respect to the model. This produces estimates for each pixel describing local size, orientation, Fourier phase, and shape with confidence measures associated to each parameter. Tentative object descriptions are then extracted from the pixel-based features by application-specific procedures with knowledge of relevant structures in mammograms. The orientation, relative brightness and shape of the object are obtained by selection of the pixel feature estimates which best describe the object. The list of object descriptions is examined by procedures, where each procedure corresponds to a specific type of suspicious area, e.g. clusters of microcalcifications.
@incollection{diva2:275345,
author = {Bårman, Håkan and Granlund, Gösta H. and Haglund, Leif},
title = {{Feature Extraction for Computer-Aided Analysis of Mammograms}},
booktitle = {State of the Art in Digital Mammographic Image Analysis},
year = {1994},
publisher = {World Scientific Publishing Co. Ltd},
address = {Singapore},
}
The Application Visualization System software from Advanced Visual Systems Inc is an interactive visualization environment for scientists, engineers and technical professionals. This report contains a short overview of the AVS software packages and a discussion about its general performance. The software package has actively been used at the Computer Vision Laboratory, Linköping University, during the last three years. The AVS package has been used in many applications. Examples are generating images from a virtual environment, simulation of a controllable robot with a stereo camera head and visualization of multidimensional data structures. Lately we also have used AVS for handling communication between different processes which may be distributed on different machines. AVS was primarily developed as a tool for visualization of complex data sets. However, another important aspect of the software is that it can be used as an advanced workbench for controlling networks of Unix processes (including external ones on different machine types) using simple visual programming.
@incollection{diva2:274829,
author = {Westelius, Carl-Johan and Wiklund, Johan and Westin, Carl-Fredrik},
title = {{Prototyping, Visualization and Simulation Using the Application Visualization System}},
booktitle = {Experimental Environments for Computer Vision and Image Processing},
year = {1994},
pages = {33--62},
publisher = {World Scientific Publishing Co. Pte. Ltd.},
address = {Singapore},
}
@incollection{diva2:275349,
author = {Haglund, Leif and Bårman, Håkan and Knutsson, Hans},
title = {{Estimation of Velocity and Acceleration in Time Sequences}},
booktitle = {Theory \& Applications of Image Analysis},
year = {1992},
pages = {223--236},
publisher = {World Scientific Publishing Co},
address = {Singapore},
}
@incollection{diva2:275343,
author = {Andersson, Mats and Knutsson, Hans},
title = {{Orientation Estimation in Ambiguous Neighbourhoods}},
booktitle = {Theory \& Applications of Image Analysis},
year = {1992},
pages = {189--210},
publisher = {World Scientific Publishing Co},
address = {Singapore},
}
@incollection{diva2:275355,
author = {Granlund, Gösta H. and Arvidsson, Jan},
title = {{The GOP Image Computer}},
booktitle = {Fundamentals in Computer Vision},
year = {1983},
publisher = {Cambridge University Press},
address = {Cambridge},
}
@incollection{diva2:275353,
author = {Granlund, Gösta H. and Knutsson, Hans},
title = {{Contrast of Structured and Homogenous Representations}},
booktitle = {Physical and Biological Processing of Images},
year = {1983},
pages = {282--303},
publisher = {Springer Verlag},
address = {Berlin},
}
@incollection{diva2:275351,
author = {Granlund, Gösta H. and Knutsson, Hans and Wilson, Roland},
title = {{Image Enhancement}},
booktitle = {Fundamentals in Computer Vision},
year = {1983},
pages = {57--68},
publisher = {Cambridge University Press},
address = {Cambridge},
}
@incollection{diva2:275339,
author = {Knutsson, Hans and Edholm, Paul and Granlund, Gösta H.},
title = {{Aspects of 3-D Reconstruction by Fourier Techniques}},
booktitle = {Digital Signal Processing},
year = {1980},
publisher = {Academic Press},
address = {London},
}
@incollection{diva2:275340,
author = {Granlund, Gösta H.},
title = {{The Use of Distribution Functions to Describe Interated Profiles of Human Chromosomes}},
booktitle = {Chromosome Identification, Proceedings of the 23rd Nobel Symposium},
year = {1973},
publisher = {Academic Press},
address = {New York},
}
Conference papers
Image-level weakly-supervised semantic segmentation (WSSS) reduces the usually vast data annotation cost by surrogate segmentation masks during training. The typical approach involves training an image classification network using global average pooling (GAP) on convolutional feature maps. This enables the estimation of object locations based on class activation maps (CAMs), which identify the importance of image regions. The CAMs are then used to generate pseudo-labels, in the form of segmentation masks, to supervise a segmentation model in the absence of pixel-level ground truth. Our work is based on two techniques for improving CAMs; importance sampling, which is a substitute for GAP, and the feature similarity loss, which utilizes a heuristic that object contours almost always align with color edges in images. However, both are based on the multinomial posterior with softmax, and implicitly assume that classes are mutually exclusive, which turns out suboptimal in our experiments. Thus, we reformulate both techniques based on binomial posteriors of multiple independent binary problems. This has two benefits; their performance is improved and they become more general, resulting in an add-on method that can boost virtually any WSSS method. This is demonstrated on a wide variety of baselines on the PASCAL VOC dataset, improving the region similarity and contour quality of all implemented state-of-the-art methods. Experiments on the MS COCO dataset further show that our proposed add-on is well-suited for large-scale settings. Our code implementation is available at https://github.com/arvijj/hfpl.
@inproceedings{diva2:1851761,
author = {Jonnarth, Arvi and Zhang, Yushan and Felsberg, Michael},
title = {{High-fidelity Pseudo-labels for Boosting Weakly-Supervised Segmentation}},
booktitle = {2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2024},
pages = {999--1008},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
The field of visual object tracking is dominated by methods that combine simple tracking algorithms and ad hoc schemes. Probabilistic tracking algorithms, which are leading in other fields, are surprisingly absent from the leaderboards. We found that accounting for distance in target kinematics, exploiting detector confidence and modelling non-uniform clutter characteristics is critical for a probabilistic tracker to work in visual tracking. Previous probabilistic methods fail to address most or all these aspects, which we believe is why they fall so far behind current state-of-the-art (SOTA) methods (there are no probabilistic trackers in the MOT17 top 100). To rekindle progress among probabilistic approaches, we propose a set of pragmatic models addressing these challenges, and demonstrate how they can be incorporated into a probabilistic framework. We present BASE (Bayesian Approximation Single-hypothesis Estimator), a simple, performant and easily extendible visual tracker, achieving state-of-the-art (SOTA) on MOT17 and MOT20, without using Re-Id. Code available at https://github.com/ffi-no/paper-base-visapp-2024.
@inproceedings{diva2:1843135,
author = {Larsen, Martin and Rolfsfjord, Sigmund and Gusland, Daniel and Ahlberg, Jörgen and Mathiassen, Kim},
title = {{BASE: Probably a Better Approach to Visual Multi-Object Tracking}},
booktitle = {Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Rome, Italy, 2024},
year = {2024},
series = {VISIGRAPP},
pages = {110--121},
publisher = {SciTePress},
}
Taking a better look at subjects of interest helps humans to improve confidence in their age estimation. Unlike still images, sequences offer spatio-temporal dynamic information that contains many cues related to age progression. A review of previous work on video-based age estimation indicates that this is an underexplored field of research. This may be caused by a lack of well-defined and publicly accessible video benchmark protocol, as well as the absence of video-oriented training data. To address the former issue, we propose a carefully designed video age estimation benchmark protocol and make it publicly available. To address the latter issue, we design a video-specific age estimation method that leverages pseudo-labeling and semi-supervised learning. Our results show that the proposed method outperforms image-based baselines on both offline and online benchmark protocols, while the online estimation stability is improved by more than 50%.
@inproceedings{diva2:1843081,
author = {Be\v{s}eni\'{c}, Kre\v{s}imir and Ahlberg, Jörgen and Pandži\'{c}, Igor},
title = {{Let Me Take a Better Look: Towards Video-Based Age Estimation}},
booktitle = {\emph{Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods - ICPRAM, Rome , Italy}},
year = {2024},
series = {ICPRAM},
pages = {57--59},
}
Magnetic resonance imaging (MRI) is a slow diagnostic technique due to its time-consuming acquisition speed. To address this, parallel imaging and compressed sensing methods were developed. Parallel imaging acquires multiple anatomy views simultaneously, while compressed sensing acquires fewer samples than traditional methods. However, reconstructing images from undersampled multi-coil data remains challenging. Existing methods concatenate input slices and adjacent slices along the channel dimension to gather more information for MRI reconstruction. Implicit feature alignment within adjacent slices is crucial for optimal reconstruction performance. Hence, we propose MFormer: an accelerated MRI reconstruction transformer with cascading MFormer blocks containing multi-scale Dynamic Deformable Swin Transformer (DST) modules. Unlike other methods, our DST modules implicitly align adjacent slice features using dynamic deformable convolution and extract local non-local features before merging information. We adapt input variations by aggregating deformable convolution kernel weights and biases through a dynamic weight predictor. Extensive experiments on Stanford2D, Stanford3D, and large-scale FastMRI datasets show the merits of our contributions, achieving state-of-the-art MRI reconstruction performance. Our code and models are available at https://github.com/wafaAlghallabi/MFomer.
@inproceedings{diva2:1825526,
author = {Alghallabi, Wafa and Dudhane, Akshay and Zamir, Waqas and Khan, Salman and Khan, Fahad},
title = {{Accelerated MRI Reconstruction via Dynamic Deformable Alignment Based Transformer}},
booktitle = {MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I},
year = {2024},
series = {Lecture Notes in Computer Science},
pages = {104--114},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
We present a method to efficiently generate 3D-aware high-resolution images that are view-consistent across multiple target views. The proposed multiplane neural radiance model, named GMNR, consists of a novel a-guided view-dependent representation (a-VdR) module for learning view-dependent information. The a-VdR module, faciliated by an a-guided pixel sampling technique, computes the view-dependent representation efficiently by learning viewing direction and position coefficients. Moreover, we propose a view-consistency loss to enforce photometric similarity across multiple views. The GMNR model can generate 3D-aware high-resolution images that are view-consistent across multiple camera poses, while maintaining the computational efficiency in terms of both training and inference time. Experiments on three datasets demonstrate the effectiveness of the proposed modules, leading to favorable results in terms of both generation quality and inference time, compared to existing approaches. Our GMNR model generates 3D-aware images of 1024 x 1024 pixels with 17.6 FPS on a single V100. Code : https: //github.com/VIROBO-15/GMNR
@inproceedings{diva2:1852131,
author = {Kumar, Amandeep and Bhunia, Ankan Kumar and Narayan, Sanath and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad},
title = {{Generative Multiplane Neural Radiance for 3D-Aware Image Generation}},
booktitle = {2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV},
year = {2023},
series = {IEEE International Conference on Computer Vision},
pages = {7354--7364},
publisher = {IEEE COMPUTER SOC},
}
3D instance segmentation has recently garnered increased attention. Typical deep learning methods adopt point grouping schemes followed by hand-designed geometric clustering. Inspired by the success of transformers for various 3D tasks, newer hybrid approaches have utilized transformer decoders coupled with convolutional backbones that operate on voxelized scenes. However, due to the nature of sparse feature backbones, the extracted features provided to the transformer decoder are lacking in spatial understanding. Thus, such approaches often predict spatially separate objects as single instances. To this end, we introduce a novel approach for 3D point clouds instance segmentation that addresses the challenge of generating distinct instance masks for objects that share similar appearances but are spatially separated. Our method leverages spatial and semantic supervision with query refinement to improve the performance of hybrid 3D instance segmentation models. Specifically, we provide the transformer block with spatial features to facilitate differentiation between similar object queries and incorporate semantic supervision to enhance prediction accuracy based on object class. Our proposed approach outperforms existing methods on the validation sets of ScanNet V2 and ScanNet200 datasets, establishing a new state-of-the-art for this task.
@inproceedings{diva2:1852130,
author = {Al Khatib, Salwa and Boudjoghra, Mohamed El Amine and Lahoud, Jean and Khan, Fahad},
title = {{3D Instance Segmentation via Enhanced Spatial and Semantic Supervision}},
booktitle = {2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV},
year = {2023},
series = {IEEE International Conference on Computer Vision},
pages = {541--550},
publisher = {IEEE COMPUTER SOC},
}
Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on five large-scale datasets (Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower computational cost. Our code/models are released at https://github.com/TalalWasim/Video-FocalNets.
@inproceedings{diva2:1852001,
author = {Wasim, Syed Talal and Khattak, Muhammad Uzair and Naseer, Muzammal and Khan, Salman and Shah, Mubarak and Khan, Fahad},
title = {{Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition}},
booktitle = {2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023)},
year = {2023},
series = {IEEE International Conference on Computer Vision},
pages = {13732--13743},
publisher = {IEEE COMPUTER SOC},
}
Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's original generalization capability. To address this issue, our work introduces a self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations using a three-pronged approach by: (a) regulating prompted representations via mutual agreement maximization with the frozen model, (b) regulating with selfensemble of prompts over the training trajectory to encode their complementary strengths, and (c) regulating with textual diversity to mitigate sample diversity imbalance with the visual branch. To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity. PromptSRC explicitly steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization. We perform extensive experiments on 4 benchmarks where PromptSRC overall performs favorably well compared to the existing methods. Our code and pre-trained models are publicly available at: https://github.com/muzairkhattak/PromptSRC.
@inproceedings{diva2:1851995,
author = {Khattak, Muhammad Uzair and Wasim, Syed Talal and Naseer, Muzammal and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad},
title = {{Self-regulating Prompts: Foundational Model Adaptation without Forgetting}},
booktitle = {2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023)},
year = {2023},
series = {IEEE International Conference on Computer Vision},
pages = {15144--15154},
publisher = {IEEE COMPUTER SOC},
}
Ensembles of independently trained deep neural networks yield uncertainty estimates that rival Bayesian networks in performance. They also offer sizable improvements in terms of predictive performance over single models. However, deep ensembles are not commonly used in environments with limited computational budget - such as autonomous driving - since the complexity grows linearly with the number of ensemble members. An important observation that can be made for robotics applications, such as autonomous driving, is that data is typically sequential. For instance, when an object is to be recognized, an autonomous vehicle typically observes a sequence of images, rather than a single image. This raises the question, could the deep ensemble be spread over time? In this work, we propose and analyze Deep Ensembles Spread Over Time (DESOT). The idea is to apply only a single ensemble member to each data point in the sequence, and fuse the predictions over a sequence of data points. We implement and experiment with DESOT for traffic sign classification, where sequences of tracked image patches are to be classified. We find that DESOT obtains the benefits of deep ensembles, in terms of predictive and uncertainty estimation performance, while avoiding the added computational cost. Moreover, DESOT is simple to implement and does not require sequences during training. Finally, we find that DESOT, like deep ensembles, outperform single models for out-of-distribution detection.
@inproceedings{diva2:1847538,
author = {Meding, Isak and Bodin, Alexander and Tonderski, Adam and Johnander Fax\'{e}n, Joakim and Petersson, Christoffer and Svensson, Lennart},
title = {{You can have your ensemble and run it too - Deep Ensembles Spread Over Time}},
booktitle = {2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW},
year = {2023},
series = {IEEE International Conference on Computer Vision Workshops},
pages = {4022--4031},
publisher = {IEEE COMPUTER SOC},
}
The Visual Object Tracking Segmentation VOTS2023 challenge is the eleventh annual tracker benchmarking activity of the VOT initiative. This challenge is the first to merge short-term and long-term as well as single-target and multiple-target tracking with segmentation masks as the only target location specification. A new dataset was created; the ground truth has been withheld to prevent overfitting. New performance measures and evaluation protocols have been created along with a new toolkit and an evaluation server. Results of the presented 47 trackers indicate that modern tracking frameworks are well-suited to deal with convergence of short-term and long-term tracking and that multiple and single target tracking can be considered a single problem. A leaderboard, with participating trackers details, the source code, the datasets, and the evaluation kit are publicly available at the challenge website(1).
@inproceedings{diva2:1847535,
author = {Kristan, Matej and Matas, Jiri and Danelljan, Martin and Felsberg, Michael and Chang, Hyung Jin and Zajc, Luka Cehovin and Lukezic, Alan and Drbohlav, Ondrej and Zhang, Zhongqun and Tran, Khanh-Tung and Vu, Xuan-Son and Bjorklund, Johanna and Mayer, Christoph and Zhang, Yushan and Ke, Lei and Zhao, Jie and Fernandez, Gustavo and Al-Shakarji, Noor and An, Dong and Arens, Michael and Becker, Stefan and Bhat, Goutam and Bullinger, Sebastian and Chan, Antoni B. and Chang, Shijie and Chen, Hanyuan and Chen, Xin and Chen, Yan and Chen, Zhenyu and Cheng, Yangming and Cui, Yutao and Deng, Chunyuan and Dong, Jiahua and Dunnhofer, Matteo and Feng, Wei and Fu, Jianlong and Gao, Jie and Han, Ruize and Hao, Zeqi and He, Jun-Yan and He, Keji and He, Zhenyu and Hu, Xiantao and Huang, Kaer and Huang, Yuqing and Jiang, Yi and Kang, Ben and Lan, Jin-Peng and Lee, Hyungjun and Li, Chenyang and Li, Jiahao and Li, Ning and Li, Wangkai and Li, Xiaodi and Li, Xin and Liu, Pengyu and Liu, Yue and Lu, Huchuan and Luo, Bin and Luo, Ping and Ma, Yinchao and Miao, Deshui and Micheloni, Christian and Palaniappan, Kannappan and Park, Hancheol and Paul, Matthieu and Peng, HouWen and Qian, Zekun and Rahmon, Gani and Scherer-Negenborn, Norbert and Shao, Pengcheng and Shin, Wooksu and Kazemi, Elham Soltani and Song, Tianhui and Stiefelhagen, Rainer and Sun, Rui and Tang, Chuanming and Tang, Zhangyong and Toubal, Imad Eddine and Valmadre, Jack and van de Weijer, Joost and Van Gool, Luc and Vira, Jash and Vujasinovic, Stephane and Wan, Cheng and Wan, Jia and Wang, Dong and Wang, Fei and Wang, Feifan and Wang, He and Wang, Limin and Wang, Song and Wang, Yaowei and Wang, Zhepeng and Wu, Gangshan and Wu, Jiannan and Wu, Qiangqiang and Wu, Xiaojun and Xiao, Anqi and Xie, Jinxia and Xu, Chenlong and Xu, Min and Xu, Tianyang and Xu, Yuanyou and Yan, Bin and Yang, Dawei and Yang, Ming-Hsuan and Yang, Tianyu and Yang, Yi and Yang, Zongxin and Yin, Xuanwu and Yu, Fisher and Yu, Hongyuan and Yu, Qianjin and Yu, Weichen and Yuan, YongSheng and Yuan, Zehuan and Zhang, Jianlin and Zhang, Lu and Zhang, Tianzhu and Zhao, Guodongfang and Zhao, Shaochuan and Zheng, Yaozong and Zhong, Bineng and Zhu, Jiawen and Zhu, Xuefeng and Zhuang, Yueting and Zong, ChengAo and Zuo, Kunlong},
title = {{The First Visual Object Tracking Segmentation VOTS2023 Challenge Results}},
booktitle = {2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW},
year = {2023},
series = {IEEE International Conference on Computer Vision Workshops},
pages = {1788--1810},
publisher = {IEEE COMPUTER SOC},
}
In this work, we propose a few-shot colorectal tissue image generation method for addressing the scarcity of histopathological training data for rare cancer tissues. Our few-shot generation method, named XM-GAN, takes one base and a pair of reference tissue images as input and generates high-quality yet diverse images. Within our XM-GAN, a novel controllable fusion block densely aggregates local regions of reference images based on their similarity to those in the base image, resulting in locally consistent features. To the best of our knowledge, we are the first to investigate few-shot generation in colorectal tissue images. We evaluate our few-shot colorectral tissue image generation by performing extensive qualitative, quantitative and subject specialist (pathologist) based evaluations. Specifically, in specialist-based evaluation, pathologists could differentiate between our XM-GAN generated tissue images and real images only 55% time. Moreover, we utilize these generated images as data augmentation to address the few-shot tissue image classification task, achieving a gain of 4.4% in terms ofmean accuracy over the vanilla few-shot classifier. Code: https://github.com/VIROBO-15/XM-GAN.
@inproceedings{diva2:1830191,
author = {Kumar, Amandeep and Bhunia, Ankan Kumar and Narayan, Sanath and Cholakkal, Hisham and Anwer, Rao Muhammad and Laaksonen, Jorma and Khan, Fahad},
title = {{Cross-Modulated Few-Shot Image Generation for Colorectal Tissue Classification}},
booktitle = {MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT III},
year = {2023},
series = {Lecture Notes in Computer Science},
pages = {128--137},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
Accurate 3D mitochondria instance segmentation in electron microscopy (EM) is a challenging problem and serves as a prerequisite to empirically analyze their distributions and morphology. Most existing approaches employ 3D convolutions to obtain representative features. However, these convolution-based approaches struggle to effectively capture long-range dependencies in the volume mitochondria data, due to their limited local receptive field. To address this, we propose a hybrid encoder-decoder framework based on a split spatio-temporal attention module that efficiently computes spatial and temporal self-attentions in parallel, which are later fused through a deformable convolution. Further, we introduce a semantic foreground-background adversarial loss during training that aids in delineating the region of mitochondria instances from the background clutter. Our extensive experiments on three benchmarks, Lucchi, MitoEM-R and MitoEM-H, reveal the benefits of the proposed contributions achieving state-of-the-art results on all three datasets. Our code and models are available at https://github.com/ OmkarThawakar/STT- UNET.
@inproceedings{diva2:1827916,
author = {Thawakar, Omkar and Anwer, Rao Muhammad and Laaksonen, Jorma and Reiner, Orly and Shah, Mubarak and Khan, Fahad},
title = {{3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers}},
booktitle = {MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VIII},
year = {2023},
series = {Lecture Notes in Computer Science},
pages = {613--623},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
Detecting breast lesion in videos is crucial for computer-aided diagnosis. Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone features based on the self-attention operation. We argue that such a strategy struggles to effectively perform deep feature aggregation and ignores the useful local information. To tackle these issues, we propose a spatial-temporal deformable attention based framework, named STNet. Our STNet introduces a spatial-temporal deformable attention module to perform local spatial-temporal feature fusion. The spatial-temporal deformable attention module enables deep feature aggregation in each stage of both encoder and decoder. To further accelerate the detection speed, we introduce an encoder feature shuffle strategy for multi-frame prediction during inference. In our encoder feature shuffle strategy, we share the backbone and encoder features, and shuffle encoder features for decoder to generate the predictions of multiple frames. The experiments on the public breast lesion ultrasound video dataset show that our STNet obtains a state-of-the-art detection performance, while operating twice as fast inference speed. The code and model are available at https://github.com/AlfredQin/STNet.
@inproceedings{diva2:1827911,
author = {Qin, Chao and Cao, Jiale and Fu, Huazhu and Anwer, Rao Muhammad and Khan, Fahad},
title = {{A Spatial-Temporal Deformable Attention Based Framework for Breast Lesion Detection in Videos}},
booktitle = {MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II},
year = {2023},
series = {Lecture Notes in Computer Science},
pages = {479--488},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
It is imperative to ensure the robustness of deep learning models in critical applications such as, healthcare. While recent advances in deep learning have improved the performance of volumetric medical image segmentation models, these models cannot be deployed for real-world applications immediately due to their vulnerability to adversarial attacks. We present a 3D frequency domain adversarial attack for volumetric medical image segmentation models and demonstrate its advantages over conventional input or voxel domain attacks. Using our proposed attack, we introduce a novel frequency domain adversarial training approach for optimizing a robust model against voxel and frequency domain attacks. Moreover, we propose frequency consistency loss to regulate our frequency domain adversarial training that achieves a better tradeoff between model's performance on clean and adversarial samples. Code is available at https://github.com/asif-hanif/vafa.
@inproceedings{diva2:1827822,
author = {Hanif, Asif and Naseer, Muzammal and Khan, Salman and Shah, Mubarak and Khan, Fahad},
title = {{Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation}},
booktitle = {MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT II},
year = {2023},
series = {Lecture Notes in Computer Science},
pages = {457--467},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Fine-tuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models will be released at https://github.com/TalalWasim/Vita-CLIP..
@inproceedings{diva2:1815364,
author = {Wasim, Syed Talal and Naseer, Muzammal and Khan, Salman and Khan, Fahad and Shah, Mubarak},
title = {{Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {23034--23044},
publisher = {IEEE COMPUTER SOC},
}
Burst image processing is becoming increasingly popular in recent years. However, it is a challenging task since individual burst images undergo multiple degradations and often have mutual misalignments resulting in ghosting and zipper artifacts. Existing burst restoration methods usually do not consider the mutual correlation and non-local contextual information among burst frames, which tends to limit these approaches in challenging cases. Another key challenge lies in the robust up-sampling of burst frames. The existing up-sampling methods cannot effectively utilize the advantages of single-stage and progressive up-sampling strategies with conventional and/or recent up-samplers at the same time. To address these challenges, we propose a novel Gated Multi-Resolution Transfer Network (GMTNet) to reconstruct a spatially precise high-quality image from a burst of low-quality raw images. GMT-Net consists of three modules optimized for burst processing tasks: Multi-scale Burst Feature Alignment (MBFA) for feature denoising and alignment, Transposed-Attention Feature Merging (TAFM) for multi-frame feature aggregation, and Resolution Transfer Feature Up-sampler (RTFU) to up-scale merged features and construct a high-quality output image. Detailed experimental analysis on five datasets validate our approach and sets a state-of-the-art for burst super-resolution, burst denoising, and low-light burst enhancement. Our codes and models are available at https://github.com/nanmehta/GMTNet.
@inproceedings{diva2:1815361,
author = {Mehta, Nancy and Dudhane, Akshay and Murala, Subrahmanyam and Zamir, Syed Waqas and Khan, Salman and Khan, Fahad},
title = {{Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {22201--22210},
publisher = {IEEE COMPUTER SOC},
}
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to ne-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the exibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions. Further, we learn separate prompts across different early stages to progressively model the stage-wise feature relationships to allow rich context learning. We evaluate the effectiveness of our approach on three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes and 2.72% on overall harmonic-mean, averaged over 11 diverse image recognition datasets. Our code and pre-trained models are available at https://github.com/muzairkhattak/multimodal-prompt-learning.
@inproceedings{diva2:1815359,
author = {Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad},
title = {{MaPLe: Multi-modal Prompt Learning}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {19113--19122},
publisher = {IEEE COMPUTER SOC},
}
Many real-world recognition problems are characterized by long-tailed label distributions. These distributions make representation learning highly challenging due to limited generalization over the tail classes. If the test distribution differs from the training distribution, e.g. uniform versus long-tailed, the problem of the distribution shift needs to be addressed. A recent line of work proposes learning multiple diverse experts to tackle this issue. Ensemble diversity is encouraged by various techniques, e.g. by specializing different experts in the head and the tail classes. In this work, we take an analytical approach and extend the notion of logit adjustment to ensembles to form a Balanced Product of Experts (BalPoE). BalPoE combines a family of experts with different test-time target distributions, generalizing several previous approaches. We show how to properly define these distributions and combine the experts in order to achieve unbiased predictions, by proving that the ensemble is Fisher-consistent for minimizing the balanced error. Our theoretical analysis shows that our balanced ensemble requires calibrated experts, which we achieve in practice using mixup. We conduct extensive experiments and our method obtains new state-of-the-art results on three long-tailed datasets: CIFAR-100-LT, ImageNet-LT, and iNaturalist-2018. Our code is available at https://github.com/emasa/BalPoE-CalibratedLT.
@inproceedings{diva2:1815355,
author = {Sanchez Aimar, Emanuel and Jonnarth, Arvi and Felsberg, Michael and Kuhlmann, Marco},
title = {{Balanced Product of Calibrated Experts for Long-Tailed Recognition}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {19967--19977},
publisher = {IEEE COMPUTER SOC},
}
Deep neural networks (DNNs) have enabled astounding progress in several vision-based problems. Despite showing high predictive accuracy, recently, several works have revealed that they tend to provide overconfident predictions and thus are poorly calibrated. The majority of the works addressing the miscalibration of DNNs fall under the scope of classification and consider only in-domain predictions. However, there is little to no progress in studying the calibration of DNN-based object detection models, which are central to many vision-based safety-critical applications. In this paper, inspired by the train-time calibration methods, we propose a novel auxiliary loss formulation that explicitly aims to align the class confidence of bounding boxes with the accurateness of predictions (i.e. precision). Since the original formulation of our loss depends on the counts of true positives and false positives in a mini-batch, we develop a differentiable proxy of our loss that can be used during training with other application-specific loss functions. We perform extensive experiments on challenging in-domain and out-domain scenarios with six benchmark datasets including MS-COCO, Cityscapes, Sim10k, and BDD100k. Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios. Our source code and pre-trained models are available at https://github.com/akhtarvision/bpc_calibration
@inproceedings{diva2:1813979,
author = {Munir, Muhammad Akhtar and Khan, Muhammad Haris and Khan, Salman and Khan, Fahad},
title = {{Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {11474--11483},
publisher = {IEEE COMPUTER SOC},
}
Recent advances in 3D-aware generative models (3D-aware GANs) combined with Neural Radiance Fields (NeRF) have achieved impressive results. However no prior works investigate 3D-aware GANs for 3D consistent multi-class image-to-image (3D-aware I2I) translation. Naively using 2D-I2I translation methods suffers from unrealistic shape/identity change. To perform 3D-aware multi-class I2I translation, we decouple this learning process into a multi-class 3D-aware GAN step and a 3D-aware I2I translation step. In the first step, we propose two novel techniques: a new conditional architecture and an effective training strategy. In the second step, based on the well-trained multi-class 3D-aware GAN architecture, that preserves view-consistency, we construct a 3D-aware I2I translation system. To further reduce the view-consistency problems, we propose several new techniques, including a U-net-like adaptor network design, a hierarchical representation constrain and a relative regularization loss. In extensive experiments on two datasets, quantitative and qualitative results demonstrate that we successfully perform 3D-aware I2I translation with multi-view consistency. Code is available in 3DI2I.
@inproceedings{diva2:1813978,
author = {Li, Senmao and van de Weijer, Joost and Wang, Yaxing and Khan, Fahad and Liu, Meiqin and Yang, Jian},
title = {{3D-Aware Multi-Class Image-to-Image Translation with NeRFs}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {12652--12662},
publisher = {IEEE COMPUTER SOC},
}
Although existing semi-supervised learning models achieve remarkable success in learning with unannotated in-distribution data, they mostly fail to learn on unlabeled data sampled from novel semantic classes due to their closed-set assumption. In this work, we target a pragmatic but under-explored Generalized Novel Category Discovery (GNCD) setting. The GNCD setting aims to categorize unlabeled training data coming from known and novel classes by leveraging the information of partially labeled known classes. We propose a two-stage Contrastive Affinity Learning method with auxiliary visual Prompts, dubbed PromptCAL, to address this challenging problem. Our approach discovers reliable pairwise sample affinities to learn better semantic clustering of both known and novel classes for the class token and visual prompts. First, we propose a discriminative prompt regularization loss to reinforce semantic discriminativeness of prompt-adapted pre-trained vision transformer for refined affinity relationships. Besides, we propose contrastive affinity learning to calibrate semantic representations based on our iterative semi-supervised affinity graph generation method for semantically-enhanced supervision. Extensive experimental evaluation demonstrates that our PromptCAL method is more effective in discovering novel classes even with limited annotations and surpasses the current state-of-the-art on generic and fine-grained benchmarks (e.g., with nearly 11% gain on CUB-200, and 9% on ImageNet-100) on overall accuracy. Our code is available at https: // github.com/ sheng- eatamath / PromptCAL.
@inproceedings{diva2:1811916,
author = {Zhang, Sheng and Khan, Salman and Shen, Zhiqiang and Naseer, Muzammal and Chen, Guangyi and Khan, Fahad},
title = {{PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {3479--3488},
publisher = {IEEE COMPUTER SOC},
}
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the framelevel processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a bridge and prompt approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code and pre-trained models are available at https://github.com/muzairkhattak/ViFi-CLIP.
@inproceedings{diva2:1811912,
author = {Rasheed, Hanoona and Khattak, Muhammad Uzair and Maaz, Muhammad and Khan, Salman and Khan, Fahad},
title = {{Fine-tuned CLIP Models are Efficient Video Learners}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {6545--6554},
publisher = {IEEE COMPUTER SOC},
}
Most previous co-salient object detection works mainly focus on extracting co-salient cues via mining the consistency relations across images while ignore explicit exploration of background regions. In this paper, we propose a Discriminative co-saliency and background Mining Transformer framework (DMT) based on several economical multi-grained correlation modules to explicitly mine both co-saliency and background information and effectively model their discrimination. Specifically, we first propose a region-to-region correlation module for introducing inter-image relations to pixel-wise segmentation features while maintaining computational efficiency. Then, we use two types of pre-defined tokens to mine co-saliency and background information via our proposed contrast-induced pixel-to-token correlation and co-saliency token-to-token correlation modules. We also design a token-guided feature refinement module to enhance the discriminability of the segmentation features under the guidance of the learned tokens. We perform iterative mutual promotion for the segmentation feature extraction and token construction. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method. The source code is available at: https://github.com/dragonlee258079/DMT.
@inproceedings{diva2:1811899,
author = {Li, Long and Han, Junwei and Zhang, Ni and Liu, Nian and Khan, Salman and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Fahad},
title = {{Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {7247--7256},
publisher = {IEEE COMPUTER SOC},
}
On a shutter press, modern handheld cameras capture multiple images in rapid succession and merge them to generate a single image. However, individual frames in a burst are misaligned due to inevitable motions and contain multiple degradations. The challenge is to properly align the successive image shots and merge their complimentary information to achieve high-quality outputs. Towards this direction, we propose Burstormer: a novel transformer-based architecture for burst image restoration and enhancement. In comparison to existing works, our approach exploits multi-scale local and non-local features to achieve improved alignment and feature fusion. Our key idea is to enable inter-frame communication in the burst neighborhoodsf or information aggregation and progressive fusion while modeling the burst-wide context. However, the input burst frames need to be properly aligned before fusing their information. Therefore, we propose an enhanced deformable alignment module for aligning burst features with regards to the reference frame. Unlike existing methods, the proposed alignment module not only aligns burst features but also exchanges feature information and maintains focused communication with the reference frame through the proposed reference-based feature enrichment mechanism, which facilitates handling complex motions. Aft er multi-level alignment and enrichment, we re-emphasize on inter-frame communication within burst using a cyclic burst sampling module. Finally, the inter-frame information is aggregated using the proposed burst feature fusion module followed by progressive upsampling. Our Burstormer outperforms state-of-the-art methods on burst super-resolution, burst denoising and burst low-light enhancement. Our codes and pre-trained models are available at https://github.com/akshaydudhane16/Burstormer.
@inproceedings{diva2:1811866,
author = {Dudhane, Akshay and Zamir, Syed Waqas and Khan, Salman and Khan, Fahad and Yang, Ming-Hsuan},
title = {{Burstormer: Burst Image Restoration and Enhancement Transformer}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {5703--5712},
publisher = {IEEE COMPUTER SOC},
}
The pose-guided person image generation task requires synthesizing photorealistic images of humans in arbitrary poses. The existing approaches use generative adversarial networks that do not necessarily maintain realistic textures or need dense correspondences that struggle to handle complex deformations and severe occlusions. In this work, we show how denoising diffusion models can be applied for high-fidelity person image synthesis with strong sample diversity and enhanced mode coverage of the learnt data distribution. Our proposed Person Image Diffusion Model (PIDM) disintegrates the complex transfer problem into a series of simpler forward-backward denoising steps. This helps in learning plausible source-to-target transformation trajectories that result in faithful textures and undistorted appearance details. We introduce a texture diffusion module based on cross-attention to accurately model the correspondences between appearance and pose information available in source and target images. Further, we propose disentangled classifier-free guidance to ensure close resemblance between the conditional inputs and the synthesized output in terms of both pose and appearance information. Our extensive results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios. We also show how our generated images can help in downstream tasks. Code is available at https://github.com/ankanbhunia/PIDM.
@inproceedings{diva2:1811861,
author = {Bhunia, Ankan Kumar and Khan, Salman and Cholakkal, Hisham and Anwer, Rao Muhammad and Laaksonen, Jorma and Shah, Mubarak and Khan, Fahad},
title = {{Person Image Synthesis via Denoising Diffusion Model}},
booktitle = {2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR},
year = {2023},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {5968--5976},
publisher = {IEEE COMPUTER SOC},
}
Ensembles of independently trained deep neural networks yield uncertainty estimates that rival Bayesian networks in performance. They also offer sizable improvements in terms of predictive performance over single models. However, deep ensembles are not commonly used in environments with limited computational budget - such as autonomous driving - since the complexity grows linearly with the number of ensemble members. An important observation that can be made for robotics applications, such as autonomous driving, is that data is typically sequential. For instance, when an object is to be recognized, an autonomous vehicle typically observes a sequence of images, rather than a single image. This raises the question, could the deep ensemble be spread over time? In this work, we propose and analyze Deep Ensembles Spread Over Time (DESOT). The idea is to apply only a single ensemble member to each data point in the sequence, and fuse the predictions over a sequence of data points. We implement and experiment with DESOT for traffic sign classification, where sequences of tracked image patches are to be classified. We find that DESOT obtains the benefits of deep ensembles, in terms of predictive and uncertainty estimation performance, while avoiding the added computational cost. Moreover, DESOT is simple to implement and does not require sequences during training. Finally, we find that DESOT, like deep ensembles, outperform single models for out-of-distribution detection.
@inproceedings{diva2:1810786,
author = {Meding, Isak and Bodin, Alexander and Tonderski, Adam and Johnander, Joakim and Petersson, Christoffer and Svensson, Lennart},
title = {{You can have your ensemble and run it too -- Deep Ensembles Spread Over Time}},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
year = {2023},
pages = {4020--4029},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
We propose to leverage optical flow features for higher generalization power in semi-supervised video object segmentation. Optical flow is usually exploited as additional guidance information in many computer vision tasks. However, its relevance in video object segmentation was mainly in unsupervised settings or using the optical flow to warp or refine the previously predicted masks. Different from the latter, we propose to directly leverage the optical flow features in the target representation. We show that this enriched representation improves the encoder-decoder approach to the segmentation task. A model to extract the combined information from the optical flow and the image is proposed, which is then used as input to the target model and the decoder network. Unlike previous methods, e.g. in tracking where concatenation is used to integrate information from image data and optical flow, a simple yet effective attention mechanism is exploited in our work. Experiments on DAVIS 2017 and YouTube-VOS 2019 show that integrating the information extracted from optical flow into the original image branch results in a strong performance gain, especially in unseen classes which demonstrates its higher generalization power.
@inproceedings{diva2:1810690,
author = {Zhang, Yushan and Robinson, Andreas and Magnusson, Maria and Felsberg, Michael},
title = {{Leveraging Optical Flow Features for Higher Generalization Power in Video Object Segmentation}},
booktitle = {2023 IEEEInternational Conferenceon Image Processing},
year = {2023},
pages = {326--330},
publisher = {IEEE},
}
Images fed to a deep neural network have in general undergone several handcrafted image signal processing (ISP) operations, all of which have been optimized to produce visually pleasing images. In this work, we investigate the hypothesis that the intermediate representation of visually pleasing images is sub-optimal for downstream computer vision tasks compared to the RAW image representation. We suggest that the operations of the ISP instead should be optimized towards the end task, by learning the parameters of the operations jointly during training. We extend previous works on this topic and propose a new learnable operation that enables an object detector to achieve superior performance when compared to both previous works and traditional RGB images. In experiments on the open PASCALRAW dataset, we empirically confirm our hypothesis.
@inproceedings{diva2:1809798,
author = {Ljungbergh, William and Johnander, Joakim and Petersson, Christoffer and Felsberg, Michael},
title = {{Raw or Cooked?
Object Detection on RAW Images}},
booktitle = {Image Analysis},
year = {2023},
series = {Lecture Notes in Computer Science},
volume = {13885},
pages = {374--385},
publisher = {Springer},
}
Traditionally, monocular 3D human pose estimation employs a machine learning model to predict the most likely 3D pose for a given input image. However, a single image can be highly ambiguous and induces multiple plausible solutions for the 2D-3D lifting step, which results in overly confident 3D pose predictors. To this end, we propose DiffPose, a conditional diffusion model that predicts multiple hypotheses for a given input image. Compared to similar approaches, our diffusion model is straightforward and avoids intensive hyperparameter tuning, complex network structures, mode collapse, and unstable training. Moreover, we tackle the problem of over-simplification of the intermediate representation of the common two-step approaches which first estimate a distribution of 2D joint locations via joint-wise heatmaps and consecutively use their maximum argument for the 3D pose estimation step. Since such a simplification of the heatmaps removes valid information about possibly correct, though labeled unlikely, joint locations, we propose to represent the heatmaps as a set of 2D joint candidate samples. To extract information about the original distribution from these samples, we introduce our embedding transformer which conditions the diffusion model. Experimentally, we show that DiffPose improves upon the state of the art for multi-hypothesis pose estimation by 3-5% for simple poses and outperforms it by a large margin for highly ambiguous poses.
@inproceedings{diva2:1806371,
author = {Holmquist, Karl and Wandt, Bastian},
title = {{Diffpose: Multi-hypothesis human pose estimation using diffusion models}},
booktitle = {ICCV 2023, Paris, France, October 4-6, 2023.},
year = {2023},
}
The present work proposes the use of point clouds differential entropy as a method for reverse engineering quality assessment. This quality assessment can be used to measure the deviation of objects made with additive manufacturing or CNC techniques. The quality of the execution is intended as a measure of the deviation of the geometry of the obtained object compared to the original CAD. This paper proposes the use of the quality index of the CorAl method to assess the quality of an objects compared to its original CAD. This index, based on the differential entropy, takes on a value the closer to 0 the more they obtained object is close to the original geometry. The advantage of this method is to have a global synthetic index. It is however possible to have entropy maps of the individual points to verify which are the areas with the greatest deviation. The method is robust for comparing point clouds at different densities. Objects obtained by additive manufacturing with different print qualities were used. The quality index evaluated for each object, as defined in the CorAl method, turns out to be gradually closer to 0 as the quality of the piece's construction increases.
@inproceedings{diva2:1806262,
author = {Barberi, Emmanuele and Cucinotta, Filippo and Forss\'{e}n, Per-Erik and Raffaele, Marcello and Salmeri, Fabio},
title = {{A differential entropy-based method for reverse engineering quality assessment}},
booktitle = {ADM 2023 International Conference, Florence, Italy 6-8 September 2023},
year = {2023},
}
A challenge in image based metrology and forensics is intrinsic camera calibration when the used camera is unavailable. The unavailability raises two questions. The first question is how to find the projection model that describes the camera, and the second is to detect incorrect models. In this work, we use off-the-shelf extended PnP-methods to find the model from 2D-3D correspondences, and propose a method for model validation. The most common strategy for evaluating a projection model is comparing different models’ residual variances—however, this naive strategy cannot distinguish whether the projection model is potentially underfitted or overfitted. To this end, we model the residual errors for each correspondence, individually scale all residuals using a predicted variance and test if the new residuals are drawn from a standard normal distribution. We demonstrate the effectiveness of our proposed validation in experiments on synthetic data, simulating 2D detection and Lidar measurements. Additionally, we provide experiments using data from an actual scene and compare non-camera access and camera access calibrations. Last, we use our method to validate annotations in MegaDepth.
@inproceedings{diva2:1806260,
author = {Brissman, Emil and Forss\'{e}n, Per-Erik and Edstedt, Johan},
title = {{Camera Calibration Without Camera Access - A Robust Validation Technique for Extended PnP Methods}},
booktitle = {22nd Scandinavian Conference, SCIA 2023 Sirkka, Finland, April 18--21, 2023},
year = {2023},
series = {Lecture Notes in Computer Science},
volume = {13885},
pages = {34--49},
}
Feature matching is a challenging computer vision task that involves finding correspondences between two images of a 3D scene. In this paper we consider the dense approach instead of the more common sparse paradigm, thus striving to find all correspondences. Perhaps counter-intuitively, dense methods have previously shown inferior performance to their sparse and semi-sparse counterparts for estimation of two-view geometry. This changes with our novel dense method, which outperforms both dense and sparse methods on geometry estimation. The novelty is threefold: First, we propose a kernel regression global matcher. Secondly, we propose warp refinement through stacked feature maps and depthwise convolution kernels. Thirdly, we propose learning dense confidence through consistent depth and a balanced sampling approach for dense confidence maps. Through extensive experiments we confirm that our proposed dense method, Dense Kernelized Feature Matching, sets a new state-of-the-art on multiple geometry estimation benchmarks. In particular, we achieve an improvement on MegaDepth-1500 of +4.9 and +8.9 AUC@5° compared to the best previous sparse method and dense method respectively. Our code is provided at the following repository: https://github.com/Parskatt/DKM.
@inproceedings{diva2:1795945,
author = {Edstedt, Johan and Athanasiadis, Ioannis and Wadenbäck, Mårten and Felsberg, Michael},
title = {{DKM: Dense Kernelized Feature Matching for Geometry Estimation}},
booktitle = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023},
series = {Proceedings:IEEE Conference on Computer Vision and Pattern Recognition},
pages = {17765--17775},
publisher = {IEEE Communications Society},
}
Industrial defect detection is commonly addressed with anomaly detection (AD) methods where no or only incomplete data of potentially occurring defects is available. This work discovers previously unknown problems of studentteacher approaches for AD and proposes a solution, where two neural networks are trained to produce the same output for the defect-free training examples. The core assumption of student-teacher networks is that the distance between the outputs of both networks is larger for anomalies since they are absent in training. However, previous methods suffer from the similarity of student and teacher architecture, such that the distance is undesirably small for anomalies. For this reason, we propose asymmetric student-teacher networks (AST). We train a normalizing flow for density estimation as a teacher and a conventional feed-forward network as a student to trigger large distances for anomalies: The bijectivity of the normalizing flow enforces a divergence of teacher outputs for anomalies compared to normal data. Outside the training distribution the student cannot imitate this divergence due to its fundamentally different architecture. Our AST network compensates for wrongly estimated likelihoods by a normalizing flow, which was alternatively used for anomaly detection in previous work. We show that our method produces state-of-the-art results on the two currently most relevant defect detection datasets MVTec AD and MVTec 3D-AD regarding image-level anomaly detection on RGB and 3D data.
@inproceedings{diva2:1792621,
author = {Rudolph, Marco and Wehrbein, Tom and Rosenhahn, Bodo and Wandt, Bastian},
title = {{Asymmetric Student-Teacher Networks for Industrial Anomaly Detection}},
booktitle = {2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV)},
year = {2023},
series = {IEEE Winter Conference on Applications of Computer Vision},
pages = {2591--2601},
publisher = {IEEE COMPUTER SOC},
}
Super-resolving medical images can help physicians in providing more accurate diagnostics. In many situations, computed tomography (CT) or magnetic resonance imaging (MRI) techniques capture several scans (modes) during a single investigation, which can jointly be used (in a multimodal fashion) to further boost the quality of super-resolution results. To this end, we propose a novel multimodal multi-head convolutional attention module to super-resolve CT and MRI scans. Our attention module uses the convolution operation to perform joint spatial-channel attention on multiple concatenated input tensors, where the kernel (receptive field) size controls the reduction rate of the spatial attention, and the number of convolutional filters controls the reduction rate of the channel attention, respectively. We introduce multiple attention heads, each head having a distinct receptive field size corresponding to a particular reduction rate for the spatial attention. We integrate our multimodal multi-head convolutional attention (MMHCA) into two deep neural architectures for super-resolution and conduct experiments on three data sets. Our empirical results show the superiority of our attention module over the state-of-the-art attention mechanisms used in super-resolution. Moreover, we conduct an ablation study to assess the impact of the components involved in our attention module, e.g. the number of inputs or the number of heads. Our code is freely available at https://github.com/lilygeorgescu/MHCA.
@inproceedings{diva2:1792453,
author = {Georgescu, Mariana-Iuliana and Ionescu, Radu Tudor and Miron, Andreea-Iuliana and Savencu, Olivian and Ristea, Nicolae-Catalin and Verga, Nicolae and Khan, Fahad},
title = {{Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution}},
booktitle = {2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV)},
year = {2023},
series = {IEEE Winter Conference on Applications of Computer Vision},
pages = {2194--2204},
publisher = {IEEE COMPUTER SOC},
}
Class-Incremental Learning is a challenging problem in machine learning that aims to extend previously trained neural networks with new classes. This is especially useful if the system is able to classify new objects despite the original training data being unavailable. Although the semantic segmentation problem has received less attention than classification, it poses distinct problems and challenges, since previous and future target classes can be unlabeled in the images of a single increment. In this case, the background, past and future classes are correlated and there exists a background-shift.
In this paper, we address the problem of how to model unlabeled classes while avoiding spurious feature clustering of future uncorrelated classes. We propose to use Evidential Deep Learning to model the evidence of the classes as a Dirichlet distribution. Our method factorizes the problem into a separate foreground class probability, calculated by the expected value of the Dirichlet distribution, and an unknown class (background) probability corresponding to the uncertainty of the estimate. In our novel formulation, the background probability is implicitly modeled, avoiding the feature space clustering that comes from forcing the model to output a high background score for pixels that are not labeled as objects. Experiments on the incremental Pascal VOC and ADE20k benchmarks show that our method is superior to the state of the art, especially when repeatedly learning new classes with increasing number of increments.
@inproceedings{diva2:1753366,
author = {Holmquist, Karl and Klas\'{e}n, Lena and Felsberg, Michael},
title = {{Evidential Deep Learning for Class-Incremental Semantic Segmentation}},
booktitle = {Image Analysis. SCIA 2023.},
year = {2023},
series = {Lecture Notes in Computer Science},
volume = {13886},
pages = {32--48},
publisher = {Springer},
}
Following the successful application of vision transformers in multiple computer vision tasks, these models have drawn the attention of the signal processing community. This is because signals are often represented as spectrograms (e.g. through Discrete Fourier Transform) which can be directly provided as input to vision transformers. However, naively applying transformers to spectrograms is suboptimal. Since the axes represent distinct dimensions, i.e. frequency and time, we argue that a better approach is to separate the attention dedicated to each axis. To this end, we propose the Separable Transformer (SepTr), an architecture that employs two transformer blocks in a sequential manner, the first attending to tokens within the same time interval, and the second attending to tokens within the same frequency bin. We conduct experiments on three benchmark data sets, showing that our separable architecture outperforms conventional vision transformers and other state-of-the-art methods. Unlike standard transformers, SepTr linearly scales the number of trainable parameters with the input size, thus having a lower memory footprint. Our code is available as open source at https://github.com/ristea/septr.
@inproceedings{diva2:1744003,
author = {Ristea, Nicolaea Catalin and Ionescu, Radu Tudor and Khan, Fahad},
title = {{SepTr: Separable Transformer for Audio Spectrogram Processing}},
booktitle = {INTERSPEECH 2022},
year = {2022},
series = {Interspeech},
pages = {4103--4107},
publisher = {ISCA-INT SPEECH COMMUNICATION ASSOC},
}
Creative sketching or doodling is an expressive activity, where imaginative and previously unseen depictions of everyday visual objects are drawn. Creative sketch image generation is a challenging vision problem, where the task is to generate diverse, yet realistic creative sketches possessing the unseen composition of the visual-world objects. Here, we propose a novel coarse-to-fine two-stage framework, DoodleFormer, that decomposes the creative sketch generation problem into the creation of coarse sketch composition followed by the incorporation of fine-details in the sketch. We introduce graph-aware transformer encoders that effectively capture global dynamic as well as local static structural relations among different body parts. To ensure diversity of the generated creative sketches, we introduce a probabilistic coarse sketch decoder that explicitly models the variations of each sketch body part to be drawn. Experiments are performed on two creative sketch datasets: Creative Birds and Creative Creatures. Our qualitative, quantitative and human-based evaluations show that DoodleFormer outperforms the state-of-the-art on both datasets, yielding realistic and diverse creative sketches. On Creative Creatures, DoodleFormer achieves an absolute gain of 25 in Frechet inception distance (FID) over state-of-the-art. We also demonstrate the effectiveness of DoodleFormer for related applications of text to creative sketch generation, sketch completion and house layout generation. Code is available at: https://github.com/ ankanbhunia/doodleformer.
@inproceedings{diva2:1740898,
author = {Bhunia, Ankan Kumar and Khan, Salman and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Fahad and Laaksonen, Jorma and Felsberg, Michael},
title = {{DoodleFormer: Creative Sketch Drawing with Transformers}},
booktitle = {COMPUTER VISION - ECCV 2022, PT XVII},
year = {2022},
series = {Lecture Notes in Computer Science},
pages = {338--355},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
Automatically identifying harmful content in video is an important task with a wide range of applications. However, there is a lack of professionally labeled open datasets available. In this work VidHarm, an open dataset of 3589 video clips from film trailers annotated by professionals, is presented. An analysis of the dataset is performed, revealing among other things the relation between clip and trailer level annotations. Audiovisual models are trained on the dataset and an in-depth study of modeling choices conducted. The results show that performance is greatly improved by combining the visual and audio modality, pre-training on large-scale video recognition datasets, and class balanced sampling. Lastly, biases of the trained models are investigated using discrimination probing. VidHarm is openly available, and further details are available at the webpage https://vidharm.github.io/
@inproceedings{diva2:1738691,
author = {Edstedt, Johan and Berg, Amanda and Felsberg, Michael and Karlsson, Johan and Benavente, Francisca and Novak, Anette and Pihlgren, Gustav Grund},
title = {{VidHarm: A Clip Based Dataset for Harmful Content Detection}},
booktitle = {2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)},
year = {2022},
series = {International Conference on Pattern Recognition},
pages = {1543--1549},
publisher = {IEEE},
}
State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations. We argue that such an attention computation ignores the multiscale spatio-temporal feature relationships that are crucial to tackle target appearance deformations in videos. To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder. The proposed MS-STS module effectively captures spatio-temporal feature relationships at multiple scales across frames in a video. We further introduce an attention block in the decoder to enhance the temporal consistency of the detected instances in different frames of a video. Moreover, an auxiliary discriminator is introduced during training to ensure better foreground-background separability within the multiscale spatio-temporal feature space. We conduct extensive experiments on two benchmarks: Youtube-VIS (2019 and 2021). Our MS-STS VIS achieves state-of-the-art performance on both benchmarks. When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50.1%, outperforming the best reported results in literature by 2.7% and by 4.8% at higher overlap threshold of AP75, while being comparable in model size and speed on Youtube-VIS 2019 val. set. When using the Swin Transformer backbone, MS-STS VIS achieves mask AP of 61.0% on Youtube-VIS 2019 val. set.
@inproceedings{diva2:1733507,
author = {Thawakar, Omkar and Narayan, Sanath and Cao, Jiale and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Muhammad Haris and Khan, Salman and Felsberg, Michael and Khan, Fahad},
title = {{Video Instance Segmentation via Multi-Scale Spatio-Temporal Split Attention Transformer}},
booktitle = {COMPUTER VISION, ECCV 2022, PT XXIX},
year = {2022},
series = {Lecture Notes in Computer Science},
pages = {666--681},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
Few-shot segmentation is a challenging dense prediction task, which entails segmenting a novel query image given only a small annotated support set. The key problem is thus to design a method that aggregates detailed information from the support set, while being robust to large variations in appearance and context. To this end, we propose a few-shot segmentation method based on dense Gaussian process (GP) regression. Given the support set, our dense GP learns the mapping from local deep image features to mask values, capable of capturing complex appearance distributions. Furthermore, it provides a principled means of capturing uncertainty, which serves as another powerful cue for the final segmentation, obtained by a CNN decoder. Instead of a one-dimensional mask output, we further exploit the end-to-end learning capabilities of our approach to learn a high-dimensional output space for the GP. Our approach sets a new state-of-the-art on the PASCAL-5(i) and COCO-20(i) benchmarks, achieving an absolute gain of +8.4 mIoU in the COCO-20(i) 5-shot setting. Furthermore, the segmentation quality of our approach scales gracefully when increasing the support set size, while achieving robust cross-dataset transfer.
@inproceedings{diva2:1733506,
author = {Johnander, Joakim and Edstedt, Johan and Felsberg, Michael and Khan, Fahad and Danelljan, Martin},
title = {{Dense Gaussian Processes for Few-Shot Segmentation}},
booktitle = {COMPUTER VISION, ECCV 2022, PT XXIX},
year = {2022},
series = {Lecture Notes in Computer Science},
pages = {217--234},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability.
@inproceedings{diva2:1731528,
author = {Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad and Anwer, Rao Muhammad and Yang, Ming-Hsuan},
title = {{Class-Agnostic Object Detection with Multi-modal Transformer}},
booktitle = {COMPUTER VISION, ECCV 2022, PT X},
year = {2022},
series = {Lecture Notes in Computer Science},
pages = {512--531},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
We propose a novel few-shot action recognition framework, STRM, which enhances class-specific feature discriminability while simultaneously learning higher-order temporal representations. The focus of our approach is a novel spatio-temporal enrichment module that aggregates spatial and temporal contexts with dedicated local patch-level and global frame-level feature enrichment sub-modules. Local patch-level enrichment captures the appearance-based characteristics of actions. On the other hand, global frame-level enrichment explicitly encodes the broad temporal context, thereby capturing the relevant object features over time. The resulting spatio-temporally enriched representations are then utilized to learn the relational matching between query and support action sub-sequences. We further introduce a query-class similarity classifier on the patch-level enriched features to enhance class-specific feature discriminability by reinforcing the feature learning at different stages in the proposed framework. Experiments are performed on four few-shot action recognition benchmarks: Kinetics, SSv2, HMDB51 and UCF101. Our extensive ablation study reveals the benefits of the proposed contributions. Furthermore, our approach sets a new state-of-the-art on all four benchmarks. On the challenging SSv2 benchmark, our approach achieves an absolute gain of 3.5% in classification accuracy, as compared to the best existing method in the literature.
@inproceedings{diva2:1725183,
author = {Thatipelli, Anirudh and Narayan, Sanath and Khan, Salman and Anwer, Rao Muhammad and Khan, Fahad and Ghanem, Bernard},
title = {{Spatio-temporal Relation Modeling for Few-shot Action Recognition}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {19926--19935},
publisher = {IEEE COMPUTER SOC},
}
Road networks are the core infrastructure for connected and autonomous vehicles, but creating meaningful representations for machine learning applications is a challenging task. In this work, we propose to integrate remote sensing vision data into road network data for improved embeddings with graph neural networks. We present a segmentation of road edges based on spatio-temporal road and traffic characteristics, which allows enriching the attribute set of road networks with visual features of satellite imagery and digital surface models. We show that both, the segmentation and the integration of vision data can increase performance on a road type classification task, and we achieve state-of-the-art performance on the OSM+DiDi Chuxing dataset on Chengdu, China.
@inproceedings{diva2:1725177,
author = {Stromann, Oliver and Razavi, Alireza and Felsberg, Michael},
title = {{LEARNING TO INTEGRATE VISION DATA INTO ROAD NETWORK DATA}},
booktitle = {2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)},
year = {2022},
series = {International Conference on Acoustics Speech and Signal Processing ICASSP},
pages = {4548--4552},
publisher = {IEEE},
}
Classification networks can be used to localize and segment objects in images by means of class activation maps (CAMs). However, without pixel-level annotations, classification networks are known to (1) mainly focus on discriminative regions, and (2) to produce diffuse CAMs without well-defined prediction contours. In this work, we approach both problems with two contributions for improving CAM learning. First, we incorporate importance sampling based on the class-wise probability mass function induced by the CAMs to produce stochastic image-level class predictions. This results in CAMs which activate over a larger extent of objects. Second, we formulate a feature similarity loss term which aims to match the prediction contours with edges in the image. As a third contribution, we conduct experiments on the PASCAL VOC 2012 benchmark dataset to demonstrate that these modifications significantly increase the performance in terms of contour accuracy, while being comparable to current state-of-the-art methods in terms of region similarity.
@inproceedings{diva2:1725167,
author = {Jonnarth, Arvi and Felsberg, Michael},
title = {{IMPORTANCE SAMPLING CAMS FOR WEAKLY-SUPERVISED SEGMENTATION}},
booktitle = {2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)},
year = {2022},
series = {International Conference on Acoustics Speech and Signal Processing ICASSP},
pages = {2639--2643},
publisher = {IEEE},
}
Detecting abnormal events in video is commonly framed as a one-class classification task, where training videos contain only normal events, while test videos encompass both normal and abnormal events. In this scenario, anomaly detection is an open-set problem. However, some studies assimilate anomaly detection to action recognition. This is a closed-set scenario that fails to test the capability of systems at detecting new anomaly types. To this end, we propose UBnormal, a new supervised open-set benchmark composed of multiple virtual scenes for video anomaly detection. Unlike existing data sets, we introduce abnormal events annotated at the pixel level at training time, for the first time enabling the use of fully-supervised learning methods for abnormal event detection. To preserve the typical open-set formulation, we make sure to include dis-joint sets of anomaly types in our training and test collections of videos. To our knowledge, UBnormal is the first video anomaly detection benchmark to allow a fair head-to-head comparison between one-class open-set models and supervised closed-set models, as shown in our experiments. Moreover, we provide empirical evidence showing that UB-normal can enhance the performance of a state-of-the-art anomaly detection framework on two prominent data sets, Avenue and ShanghaiTech. Our benchmark is freely available at https://github.com/lilygeorgescu/UBnormal.
@inproceedings{diva2:1725086,
author = {Acsintoae, Andra and Florescu, Andrei and Georgescu, Mariana-Iuliana and Mare, Tudor and Sumedrea, Paul and Ionescu, Radu Tudor and Khan, Fahad and Shah, Mubarak},
title = {{UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {20111--20121},
publisher = {IEEE COMPUTER SOC},
}
Since convolutional neural networks (CNNs) perform well at learning generalizable image priors from largescale data, these models have been extensively applied to image restoration and related tasks. Recently, another class of neural architectures, Transformers, have shown significant performance gains on natural language and high-level vision tasks. While the Transformer model mitigates the shortcomings of CNNs (i.e., limited receptive field and inadaptability to input content), its computational complexity grows quadratically with the spatial resolution, therefore making it infeasible to apply to most image restoration tasks involving high-resolution images. In this work, we propose an efficient Transformer model by making several key designs in the building blocks (multi-head attention and feed-forward network) such that it can capture long-range pixel interactions, while still remaining applicable to large images. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks, including image deraining, single-image motion deblurring, defocus deblurring (single-image and dual-pixel data), and image denoising (Gaussian grayscale/color denoising, and real image denoising). The source code and pre-trained models are available at https://github.com/swz30/Restormer.
@inproceedings{diva2:1720852,
author = {Zamir, Syed Waqas and Arora, Aditya and Khan, Salman and Hayat, Munawar and Khan, Fahad and Yang, Ming-Hsuan},
title = {{Restormer: Efficient Transformer for High-Resolution Image Restoration}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {5718--5729},
publisher = {IEEE COMPUTER SOC},
}
Anomaly detection is commonly pursued as a one-class classification problem, where models can only learn from normal training samples, while being evaluated on both normal and abnormal test samples. Among the successful approaches for anomaly detection, a distinguished category of methods relies on predicting masked information (e.g. patches, future frames, etc.) and leveraging the reconstruction error with respect to the masked information as an abnormality score. Different from related methods, we propose to integrate the reconstruction-based functionality into a novel self-supervised predictive architectural building block. The proposed self-supervised block is generic and can easily be incorporated into various state-of-the-art anomaly detection methods. Our block starts with a convolutional layer with dilated filters, where the center area of the receptive field is masked. The resulting activation maps are passed through a channel attention module. Our block is equipped with a loss that minimizes the reconstruction error with respect to the masked area in the receptive field. We demonstrate the generality of our block by integrating it into several state-of-the-art frameworks for anomaly detection on image and video, providing empirical evidence that shows considerable performance improvements on MVTec AD, Avenue, and ShanghaiTech. We release our code as open source at: https://github.com/ ristea/sspcab.
@inproceedings{diva2:1720849,
author = {Ristea, Nicolae-Catalin and Madan, Neelu and Ionescu, Radu Tudor and Nasrollahi, Kamal and Khan, Fahad and Moeslund, Thomas B. and Shah, Mubarak},
title = {{Self-Supervised Predictive Convolutional Attentive Block for Anomaly Detection}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {13566--13576},
publisher = {IEEE COMPUTER SOC},
}
In this paper, we propose self:supervised training for video transformers using unlabeled video data. From a given video, we create local and global spatiotemporal views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of these different views representing the same video, to be invariant to spatiotemporal variations in actions. To the best of our knowledge, the proposed approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT). Further; owing to the flexibility of Transformer models, SVT supports slow-fast video processing within a single architecture using dynamically adjusted positional encoding and supports long-term relationship modeling along spatiotemporal dimensions. Our approach performs well on four action recognition benchmarks (Kinetics-400, UCF-101, HMDB-51, and SSv2) and converges faster with small batch sizes. Code is available at: https://git.io/J1juJ
@inproceedings{diva2:1720846,
author = {Ranasinghe, Kanchana and Naseer, Muzammal and Khan, Salman and Khan, Fahad and Ryoo, Michael S.},
title = {{Self-supervised Video Transformer}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {2864--2874},
publisher = {IEEE COMPUTER SOC},
}
Deep learning models tend to forget their earlier knowledge while incrementally learning new tasks. This behavior emerges because the parameter updates optimized for the new tasks may not align well with the updates suitable for older tasks. The resulting latent representation mismatch causes forgetting. In this work, we propose ELI: Energy-based Latent Aligner for Incremental Learning, which first learns an energy manifold for the latent representations such that previous task latents will have low energy and the current task latents have high energy values. This learned manifold is used to counter the representational shift that happens during incremental learning. The implicit regularization that is offered by our proposed methodology can be used as a plug-and-play module in existing incremental learning methodologies. We validate this through extensive evaluation on CIFAR-100, ImageNet subset, ImageNet 1k and Pascal VOC datasets. We observe consistent improvement when ELI is added to three prominent methodologies in class-incremental learning, across multiple incremental settings. Further, when added to the state-of-the-art incremental object detector, ELI provides over 5% improvement in detection accuracy, corroborating its effectiveness and complementary advantage to the existing art. Code is available at: https://github.com/JosephKJ/ELI.
@inproceedings{diva2:1720844,
author = {Joseph, K. J. and Khan, Salman and Khan, Fahad and Anwer, Rao Muhammad and Balasubramanian, Vineeth N.},
title = {{Energy-based Latent Aligner for Incremental Learning}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {7442--7451},
publisher = {IEEE COMPUTER SOC},
}
Open-world object detection (OWOD) is a challenging computer vision problem, where the task is to detect a known set of object categories while simultaneously identifying unknown objects. Additionally, the model must incrementally learn new classes that become known in the next training episodes. Distinct from standard object detection, the OWOD setting poses significant challenges for generating quality candidate proposals on potentially unknown objects, separating the unknown objects from the background and detecting diverse unknown objects. Here, we introduce a novel end-to-end transformer-based framework, OW-DETR, for open-world object detection. The proposed OW-DETR comprises three dedicated components namely, attention-driven pseudo-labeling, novelty classification and objectness scoring to explicitly address the aforementioned OWOD challenges. Our OW-DETR explicitly encodes multi-scale contextual information, possesses less inductive bias, enables knowledge transfer from known classes to the unknown class and can better discriminate between unknown objects and background. Comprehensive experiments are performed on two benchmarks: MS-COCO and PASCAL VOC. The extensive ablations reveal the merits of our proposed contributions. Further, our model outperforms the recently introduced OWOD approach, ORE, with absolute gains ranging from 1.8% to 3.3% in terms of unknown recall on MS-COCO. In the case of incremental object detection, OW-DETR outperforms the state-of-theart for all settings on PASCAL VOC. Our code is available at https://github.com/akshitac8/OW-DETR.
@inproceedings{diva2:1720843,
author = {Gupta, Akshita and Narayan, Sanath and Joseph, K. J. and Khan, Salman and Khan, Fahad and Shah, Mubarak},
title = {{OW-DETR: Open-world Detection Transformer}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {9225--9234},
publisher = {IEEE COMPUTER SOC},
}
Modern handheld devices can acquire burst image sequence in a quick succession. However, the individual acquired frames suffer from multiple degradations and are misaligned due to camera shake and object motions. The goal of Burst Image Restoration is to effectively combine complimentary cues across multiple burst frames to generate high-quality outputs. Towards this goal, we develop a novel approach by solely focusing on the effective information exchange between burst frames, such that the degradations get filtered out while the actual scene details are preserved and enhanced. Our central idea is to create a set of pseudo-burst features that combine complimentary information from all the input burst frames to seamlessly exchange information. However, the pseudo-burst cannot be successfully created unless the individual burst frames are properly aligned to discount interframe movements. Therefore, our approach initially extracts pre-processed features from each burst frame and matches them using an edge-boosting burst alignment module. The pseudo-burst features are then created and enriched using multi-scale contextual information. Our final step is to adaptively aggregate information from the pseudo-burst features to progressively increase resolution in multiple stages while merging the pseudo-burst features. In comparison to existing works that usually follow a late fusion scheme with single-stage upsampling, our approach performs favorably, delivering state-of-the-art performance on burst super-resolution, burst low-light image enhancement and burst denoising tasks. The source code and pre-trained models are available at https://github.com/akshaydudhane16/BIPNet.
@inproceedings{diva2:1720837,
author = {Dudhane, Akshay and Zamir, Syed Waqas and Khan, Salman and Khan, Fahad and Yang, Ming-Hsuan},
title = {{Burst Image Restoration and Enhancement}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {5749--5758},
publisher = {IEEE COMPUTER SOC},
}
We propose a novel one-step transformer-based person search framework, PSTR, that jointly performs person detection and re-identification (re-id) in a single architecture. PSTR comprises a person search-specialized (PSS) module that contains a detection encoder-decoder for person detection along with a discriminative re-id decoder for person re-id. The discriminative re-id decoder utilizes a multi-level supervision scheme with a shared decoder for discriminative re-id feature learning and also comprises a part attention block to encode relationship between different parts of a person. We further introduce a simple multi-scale scheme to support re-id across person instances at different scales. PSTR jointly achieves the diverse objectives of object-level recognition (detection) and instance-level matching (re-id). To the best of our knowledge, we are the first to propose an end-to-end one-step transformer-based person search framework. Experiments are performed on two popular benchmarks: CUHK-SYSU and PRW. Our extensive ablations reveal the merits of the proposed contributions. Further, the proposed PSTR sets a new state-of-the-art on both benchmarks. On the challenging PRW benchmark, PSTR achieves a mean average precision (mAP) score of 56.5%.
@inproceedings{diva2:1720835,
author = {Cao, Jiale and Pang, Yanwei and Anwer, Rao Muhammad and Cholakkal, Hisham and Xie, Jin and Shah, Mubarak and Khan, Fahad},
title = {{PSTR: End-to-End One-Step Person Search With Transformers}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
year = {2022},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {9448--9457},
publisher = {IEEE COMPUTER SOC},
}
Modern digital cameras generally count on image signal processing (ISP) pipelines for producing naturalistic RGB images. Nevertheless, in comparison to DSLR cameras, low-quality images are generally output from portable mobile devices due to their physical limitations. The synthesized low-quality images usually have multiple degradations - low-resolution owing to small camera sensors, mosaic patterns on account of camera filter array and subpixel shifts due to camera motion. Such degradation usually restrain the performance of single image super-resolution methodologies for retrieving high-resolution (HR) image from a single low-resolution (LR) image. Burst image super-resolution aims at restoring a photo-realistic HR image by capturing the abundant information from multiple LR images. Lately, the soaring popularity of burst photography has made multi-frame processing an attractive solution for overcoming the limitations of single image processing. In our work, we thus aim to propose a generic architecture, adaptive feature consolidation network (AFCNet) for multi-frame processing. To alleviate the challenge of effectively modelling the long-range dependency problem, that multi-frame approaches struggle to solve, we utilize encoder-decoder based transformer backbone which learns multi-scale local-global representations. We propose feature alignment module to align LR burst frame features. Further, the aligned features are fused and reconstructed by abridged pseudo-burst fusion module and adaptive group upsampling modules, respectively. Our proposed approach clearly outperforms the other existing state-of-the-art techniques on benchmark datasets. The experimental results illustrate the effectiveness and generality of our proposed framework in upgrading the visual quality of HR images.
@inproceedings{diva2:1718698,
author = {Mehta, Nancy and Dudhane, Akshay and Murala, Subrahmanyam and Zamir, Syed Waqas and Khan, Salman and Khan, Fahad},
title = {{Adaptive Feature Consolidation Network for Burst Super-Resolution}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2022)},
year = {2022},
pages = {1278--1285},
publisher = {IEEE},
}
Burst super-resolution has received increased attention in recent years due to its applications in mobile photography. By merging information from multiple shifted images of a scene, burst super-resolution aims to recover details which otherwise cannot be obtained using a simple input image. This paper reviews the NTIRE 2022 challenge on burst super-resolution. In the challenge, the participants were tasked with generating a clean RGB image with 4x higher resolution, given a RAW noisy burst as input. That is, the methods need to perform joint denoising, demosaicking, and super-resolution. The challenge consisted of 2 tracks. Track 1 employed synthetic data, where pixel-accurate high-resolution ground truths are available. Track 2 on the other hand used real-world bursts captured from a handheld camera, along with approximately aligned reference images captured using a DSLR. 14 teams participated in the final testing phase. The top performing methods establish a new state-of-the-art on the burst super-resolution task.
@inproceedings{diva2:1718558,
author = {Bhat, Goutam and Danelljan, Martin and Timofte, Radu and Cao, Yizhen and Cao, Yuntian and Chen, Meiya and Chen, Xihao and Cheng, Shen and Dudhane, Akshay and Fan, Haoqiang and Gang, Ruipeng and Gao, Jian and Gu, Yan and Huang, Jie and Huang, Liufeng and Jo, Youngsu and Kang, Sukju and Khan, Salman and Khan, Fahad and Kondo, Yuki and Li, Chenghua and Li, Fangya and Li, Jinjing and Li, Youwei and Li, Zechao and Liu, Chenming and Liu, Shuaicheng and Liu, Zikun and Liu, Zhuoming and Luo, Ziwei and Luo, Zhengxiong and Mehta, Nancy and Murala, Subrahmanyam and Nam, Yoonchan and Nakatani, Chihiro and Ostyakov, Pavel and Pan, Jinshan and Song, Ge and Sun, Jian and Sun, Long and Tang, Jinhui and Ukita, Norimichi and Wen, Zhihong and Wu, Qi and Wu, Xiaohe and Xiao, Zeyu and Xiong, Zhiwei and Xu, Rongjian and Xu, Ruikang and Yan, Youliang and Yang, Jialin and Yang, Wentao and Yang, Zhongbao and Yasue, Fuma and Yao, Mingde and Yu, Lei and Zhang, Cong and Zamir, Syed Waqas and Zhang, Jianxing and Zhang, Shuohao and Zhang, Zhilu and Zheng, Qian and Zhou, Gaofeng and Zhussip, Magauiya and Zou, Xueyi and Zuo, Wangmeng},
title = {{NTIRE 2022 Burst Super-Resolution Challenge}},
booktitle = {2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2022)},
year = {2022},
pages = {1040--1060},
publisher = {IEEE},
}
In this paper, we argue that modern pre-integration methods for inertial measurement units (IMUs) are accurate enough to ignore the drift for short time intervals. This allows us to consider a simplified camera model, which in turn admits further intrinsic calibration. We develop the first-ever solver to jointly solve the relative pose problem with unknown and equal focal length and radial distortion profile while utilizing the IMU data. Furthermore, we show significant speed-up compared to state-of-the-art algorithms, with small or negligible loss in accuracy for partially calibrated setups.The proposed algorithms are tested on both synthetic and real data, where the latter is focused on navigation using unmanned aerial vehicles (UAVs). We evaluate the proposed solvers on different commercially available low-cost UAVs, and demonstrate that the novel assumption on IMU drift is feasible in real-life applications. The extended intrinsic auto-calibration enables us to use distorted input images, making tedious calibration processes obsolete, compared to current state-of-the-art methods. Code available at: https://github.com/marcusvaltonen/DronePoseLib.
@inproceedings{diva2:1699783,
author = {Valtonen Örnhag, Marcus and Persson, Patrik and Wadenbäck, Mårten and Åström, Kalle and Heyden, Anders},
title = {{Trust Your IMU: Consequences of Ignoring the IMU Drift}},
booktitle = {Proceedings 2022 IEEE/CVF Conference on Computer Visionand Pattern Recognition Workshops},
year = {2022},
series = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops},
pages = {4467--4476},
publisher = {IEEE Computer Society},
}
Recent progress towards designing models that can generalize to unseen domains (i.e domain generalization) or unseen classes (i.e zero-shot learning) has embarked interest towards building models that can tackle both domain-shift and semantic shift simultaneously (i.e zero-shot domain generalization). For models to generalize to unseen classes in unseen domains, it is crucial to learn feature representation that preserves class-level (domaininvariant) as well as domain-specific information. Motivated from the success of generative zero-shot approaches, we propose a feature generative framework integrated with a COntext COnditional Adaptive (COCOA) Batch-Normalization layer to seamlessly integrate class-level semantic and domain-specific information. The generated visual features better capture the underlying data distribution enabling us to generalize to unseen classes and domains at test-time. We thoroughly evaluate our approach on established large-scale benchmarks - DomainNet, DomainNet-LS (Limited Sources) - as well as a new CUB-Corruptions benchmark, and demonstrate promising performance over baselines and state-of-the-art methods. We show detailed ablations and analysis to verify that our proposed approach indeed allows us to generate better quality visual features relevant for zero-shot domain generalization.
@inproceedings{diva2:1691012,
author = {Mangla, Puneet and Chandhok, Shivam and Balasubramanian, Vineeth N. and Khan, Fahad},
title = {{COCOA: Context-Conditional Adaptation for Recognizing Unseen Classes in Unseen Domains}},
booktitle = {2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022)},
year = {2022},
series = {IEEE Winter Conference on Applications of Computer Vision},
pages = {1618--1627},
publisher = {IEEE COMPUTER SOC},
}
Emerging from low-level vision theory, steerable filters found their counterpart in prior work on steerable convolutional neural networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of neurons with spherical decision surfaces and operates on point clouds. Such spherical neurons are obtained by conformal embedding of Euclidean space and have recently been revisited in the context of learning representations of point sets. Focusing on 3D geometry, we exploit the isometry property of spherical neurons and derive a 3D steerability constraint. After training spherical neurons to classify point clouds in a canonical orientation, we use a tetrahedron basis to quadruplicate the neurons and construct rotation-equivariant spherical filter banks. We then apply the derived constraint to interpolate the filter bank outputs and, thus, obtain a rotation-invariant network. Finally, we use a synthetic point set and real-world 3D skeleton data to verify our theoretical findings. The code is available at https://github.com/pavlo-melnyk/steerable-3d-neurons.
@inproceedings{diva2:1686024,
author = {Melnyk, Pavlo and Felsberg, Michael and Wadenbäck, Mårten},
title = {{Steerable 3D Spherical Neurons}},
booktitle = {Proceedings of the 39th International Conference on Machine Learning},
year = {2022},
series = {Proceedings of Machine Learning Research},
pages = {15330--15339},
publisher = {PMLR},
}
@inproceedings{diva2:1679039,
author = {Naseer, Muzammal and Ranasinghe, Kanchana and Khan, Salman and Khan, Fahad Shahbaz and Porikli, Fatih},
title = {{ON IMPROVING ADVERSARIAL TRANSFERABILITY OF VISION TRANSFORMERS}},
booktitle = {The Tenth International Conference on Learning Representations (Virtual)Mon Apr 25th through Fri the 29th},
year = {2022},
}
This article proposes an architecture, which allows the prediction of intention by internally simulating perceptual states represented by action pattern vectors. To this end, associative self-organising neural networks (A-SOM) is utilised to build a hierarchical cognitive architecture for recognition and simulation of the skeleton based human actions. The abilities of the proposed architecture in recognising and predicting actions is evaluated in experiments using three different datasets of 3D actions. Based on the experiments of this article, applying internally simulated perceptual states represented by action pattern vectors improves the performance of the recognition task in all experiments. Furthermore, internal simulation of perception addresses the problem of having limited access to the sensory input, and also the future prediction of the consecutive perceptual sequences. The performance of the system is compared and discussed with similar architecture using self-organizing neural networks (SOM).
@inproceedings{diva2:1638927,
author = {Gharaee, Zahra},
title = {{Predicting the intended action using internal simulation of perception}},
booktitle = {ICAART: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2},
year = {2022},
pages = {626--635},
publisher = {SciTePress},
}
One of the main challenges of applying deep learning for robotics is the difficulty of efficiently adapting to new tasks while still maintaining the same performance on previous tasks. The problem of incrementally learning new tasks commonly struggles with catastrophic forgetting in which the previous knowledge is lost.Class-incremental learning for semantic segmentation, addresses this problem in which we want to learn new semantic classes without having access to labeled data for previously learned classes. This is a problem in industry, where few pre-trained models and open datasets matches exactly the requisites. In these cases it is both expensive and labour intensive to collect an entirely new fully-labeled dataset. Instead, collecting a smaller dataset and only labeling the new classes is much more efficient in terms of data collection.In this paper we present the class-incremental learning problem for semantic segmentation, we discuss related work in terms of the more thoroughly studied classification task and experimentally validate the current state-of-the-art for semantic segmentation. This lays the foundation as we discuss some of the problems that still needs to be investigated and improved upon in order to reach a new state-of-the-art for class-incremental semantic segmentation.
@inproceedings{diva2:1701982,
author = {Holmquist, Karl and Klas\'{e}n, Lena and Felsberg, Michael},
title = {{Class-Incremental Learning for Semantic Segmentation - A study}},
booktitle = {2021 Swedish Artificial Intelligence Society Workshop (SAIS)},
year = {2021},
pages = {25--28},
publisher = {IEEE},
}
While the untargeted black-box transferability of adversarial perturbations has been extensively studied before, changing an unseen models decisions to a specific targeted class remains a challenging feat. In this paper, we propose a new generative approach for highly transferable targeted perturbations (TTP). We note that the existing methods are less suitable for this task due to their reliance on class-boundary information that changes from one model to another, thus reducing transferability. In contrast, our approach matches the perturbed image distribution with that of the target class, leading to high targeted transferability rates. To this end, we propose a new objective function that not only aligns the global distributions of source and target images, but also matches the local neighbourhood structure between the two domains. Based on the proposed objective, we train a generator function that can adaptively synthesize perturbations specific to a given input. Our generative approach is independent of the source or target domain labels, while consistently performs well against state-of-the-art methods on a wide range of attack settings. As an example, we achieve 32.63% target transferability from (an adversarially weak) VGG19(BN) to (a strong) WideResNet on ImageNet val. set, which is 4x higher than the previous best generative attack and 16x better than instance-specific iterative attack. Code is available at: https://github.com/Muzammal-Naseer/TTP.
@inproceedings{diva2:1691110,
author = {Naseer, Muzammal and Khan, Salman and Hayat, Munawar and Khan, Fahad and Porikli, Fatih},
title = {{On Generating Transferable Targeted Perturbations}},
booktitle = {2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021)},
year = {2021},
pages = {7688--7697},
publisher = {IEEE},
}
We propose a novel transformer-based styled handwritten text image generation approach, HWT, that strives to learn both style-content entanglement as well as global and local style patterns. The proposed HWT captures the long and short range relationships within the style examples through a self-attention mechanism, thereby encoding both global and local style patterns. Further, the proposed transformer-based HWT comprises an encoder-decoder attention that enables style-content entanglement by gathering the style features of each query character. To the best of our knowledge, we are the first to introduce a transformer-based network for styled handwritten text generation. Our proposed HWT generates realistic styled handwritten text images and outperforms the state-of-the-art demonstrated through extensive qualitative, quantitative and human-based evaluations. The proposed HWT can handle arbitrary length of text and any desired writing style in a few-shot setting. Further, our HWT generalizes well to the challenging scenario where both words and writing style are unseen during training, generating realistic styled handwritten text images. Code is available at: https://github.com/ankanbhunia/HandwritingTransformers
@inproceedings{diva2:1691094,
author = {Bhunia, Ankan Kumar and Khan, Salman and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Fahad and Shah, Mubarak},
title = {{Handwriting Transformers}},
booktitle = {2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021)},
year = {2021},
pages = {1066--1074},
publisher = {IEEE},
}
Deep neural networks have achieved remarkable performance on a range of classification tasks, with softmax cross-entropy (CE) loss emerging as the de-facto objective function. The CE loss encourages features of a class to have a higher projection score on the true class-vector compared to the negative classes. However, this is a relative constraint and does not explicitly force different class features to be well-separated. Motivated by the observation that ground-truth class representations in CE loss are orthogonal (one-hot encoded vectors), we develop a novel loss function termed `Orthogonal Projection Loss (OPL) which imposes orthogonality in the feature space. OPL augments the properties of CE loss and directly enforces inter-class separation alongside intra-class clustering in the feature space through orthogonality constraints on the mini-batch level. As compared to other alternatives of CE, OPL offers unique advantages e.g., no additional learnable parameters, does not require careful negative mining and is not sensitive to the batch size. Given the plug-and-play nature of OPL, we evaluate it on a diverse range of tasks including image recognition (CIFAR-100), large-scale classification (ImageNet), domain generalization (PACS) and few-shot learning (miniImageNet, CIFAR-FS, tiered-ImageNet and Meta-dataset) and demonstrate its effectiveness across the board. Furthermore, OPL offers better robustness against practical nuisances such as adversarial attacks and label noise. Code is available at: https://github.com/kahnchana/opl.
@inproceedings{diva2:1679002,
author = {Ranasinghe, Kanchana and Naseer, Muzammal and Hayat, Munawar and Khan, Salman and Khan, Fahad},
title = {{Orthogonal Projection Loss}},
booktitle = {2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021)},
year = {2021},
pages = {12313--12323},
publisher = {IEEE},
}
This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on multiple benchmarks, including THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on all datasets, achieving gains as high as 2.3% in terms of mAP at IoU=0.5 on THUMOS14. Source code is available at https://github.com/naraysa/D2-Net.
@inproceedings{diva2:1679001,
author = {Narayan, Sanath and Cholakkal, Hisham and Hayat, Munawar and Khan, Fahad and Yang, Ming-Hsuan and Shao, Ling},
title = {{D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations}},
booktitle = {2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021)},
year = {2021},
pages = {13588--13597},
publisher = {IEEE},
}
Multi-label zero-shot learning (ZSL) is a more realistic counter-part of standard single-label ZSL since several objects can co-exist in a natural image. However, the occurrence of multiple objects complicates the reasoning and requires region-specific processing of visual features to preserve their contextual cues. We note that the best existing multi-label ZSL method takes a shared approach towards attending to region features with a common set of attention maps for all the classes. Such shared maps lead to diffused attention, which does not discriminatively focus on relevant locations when the number of classes are large. Moreover, mapping spatially-pooled visual features to the class semantics leads to inter-class feature entanglement, thus hampering the classification. Here, we propose an alternate approach towards region-based discriminability-preserving multi-label zero-shot classification. Our approach maintains the spatial resolution to preserve region-level characteristics and utilizes a bi-level attention module (BiAM) to enrich the features by incorporating both region and scene context information. The enriched region-level features are then mapped to the class semantics and only their class predictions are spatially pooled to obtain image-level predictions, thereby keeping the multi-class features disentangled. Our approach sets a new state of the art on two large-scale multi-label zero-shot benchmarks: NUS-WIDE and Open Images. On NUS-WIDE, our approach achieves an absolute gain of 6.9% mAP for ZSL, compared to the best published results.
@inproceedings{diva2:1678999,
author = {Narayan, Sanath and Gupta, Akshita and Khan, Salman and Khan, Fahad and Shao, Ling and Shah, Mubarak},
title = {{Discriminative Region-based Multi-Label Zero-Shot Learning}},
booktitle = {2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021)},
year = {2021},
pages = {8711--8720},
publisher = {IEEE},
}
Video instance segmentation is one of the core problems in computer vision. Formulating a purely learning-based method, which models the generic track management required to solve the video instance segmentation task, is a highly challenging problem. In this work, we propose a novel learning framework where the entire video instance segmentation problem is modeled jointly. To this end, we design a graph neural network that in each frame jointly processes all detections and a memory of previously seen tracks. Past information is considered and processed via a recurrent connection. We demonstrate the effectiveness of the proposed approach in comprehensive experiments. Our approach, operating at over 25 FPS, outperforms previous video real-time methods. We further conduct detailed ablative experiments that validate the different aspects of our approach.
@inproceedings{diva2:1647780,
author = {Johnander, Joakim and Brissman, Emil and Danelljan, Martin and Felsberg, Michael},
title = {{Video Instance Segmentation with Recurrent Graph Neural Networks}},
booktitle = {Pattern Recognition},
year = {2021},
series = {Lecture Notes in Computer Science},
volume = {13024},
pages = {206--221},
publisher = {Springer},
}
The Visual Object Tracking challenge VOT2021 is the ninth annual tracker benchmarking activity organized by the VOT initiative. Results of 71 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2021 challenge was composed of four sub-challenges focusing on different tracking domains: (i) VOT-ST2021 challenge focused on short-term tracking in RGB, (ii) VOT-RT2021 challenge focused on "real-time" short-term tracking in RGB, (iii) VOT-LT2021 focused on long-term tracking, namely coping with target disappearance and reappearance and (iv) VOT-RGBD2021 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2021 dataset was refreshed, while VOT-RGBD2021 introduces a training dataset and sequestered dataset for winner identification. The source code for most of the trackers, the datasets, the evaluation kit and the results along with the source code for most trackers are publicly available at the challenge website(1).
@inproceedings{diva2:1643014,
author = {Kristan, Matej and Matas, Jiri and Leonardis, Ales and Felsberg, Michael and Pflugfelder, Roman and Kamarainen, Joni-Kristian and Chang, Hyung Jin and Danelljan, Martin and Zajc, Luka Cehovin and Lukezic, Alan and Drbohlav, Ondrej and Kapyla, Jani and Häger, Gustav and Yan, Song and Yang, Jinyu and Zhang, Zhongqun and Fernandez, Gustavo and Abdelpakey, Mohamed and Bhat, Goutam and Cerkezi, Llukman and Cevikalp, Hakan and Chen, Shengyong and Chen, Xin and Cheng, Miao and Cheng, Ziyi and Chiu, Yu-Chen and Cirakman, Ozgun and Cui, Yutao and Dai, Kenan and Dasari, Mohana Murali and Deng, Qili and Dong, Xingping and Du, Daniel K. and Dunnhofer, Matteo and Feng, Zhen-Hua and Feng, Zhiyong and Fu, Zhihong and Ge, Shiming and Gorthi, Rama Krishna and Gu, Yuzhang and Gunsel, Bilge and Guo, Qing and Gurkan, Filiz and Han, Wencheng and Huang, Yanyan and Järemo-Lawin, Felix and Jhang, Shang-Jhih and Ji, Rongrong and Jiang, Cheng and Jiang, Yingjie and Juefei-Xu, Felix and Jun, Yin and Ke, Xiao and Khan, Fahad Shahbaz and Kim, Byeong Hak and Kittler, Josef and Lan, Xiangyuan and Lee, Jun Ha and Leibe, Bastian and Li, Hui and Li, Jianhua and Li, Xianxian and Li, Yuezhou and Liu, Bo and Liu, Chang and Liu, Jingen and Liu, Li and Liu, Qingjie and Lu, Huchuan and Lu, Wei and Luiten, Jonathon and Ma, Jie and Ma, Ziang and Martinel, Niki and Mayer, Christoph and Memarmoghadam, Alireza and Micheloni, Christian and Niu, Yuzhen and Paudel, Danda and Peng, Houwen and Qiu, Shoumeng and Rajiv, Aravindh and Rana, Muhammad and Robinson, Andreas and Saribas, Hasan and Shao, Ling and Shehata, Mohamed and Shen, Furao and Shen, Jianbing and Simonato, Kristian and Song, Xiaoning and Tang, Zhangyong and Timofte, Radu and Torr, Philip and Tsai, Chi-Yi and Uzun, Bedirhan and Van Gool, Luc and Voigtlaender, Paul and Wang, Dong and Wang, Guangting and Wang, Liangliang and Wang, Lijun and Wang, Limin and Wang, Linyuan and Wang, Yong and Wang, Yunhong and Wu, Chenyan and Wu, Gangshan and Wu, Xiao-Jun and Xie, Fei and Xu, Tianyang and Xu, Xiang and Xue, Wanli and Yan, Bin and Yang, Wankou and Yang, Xiaoyun and Ye, Yu and Yin, Jun and Zhang, Chengwei and Zhang, Chunhui and Zhang, Haitao and Zhang, Kaihua and Zhang, Kangkai and Zhang, Xiaohan and Zhang, Xiaolin and Zhang, Xinyu and Zhang, Zhibin and Zhao, Shaochuan and Zhen, Ming and Zhong, Bineng and Zhu, Jiawen and Zhu, Xue-Feng},
title = {{The Ninth Visual Object Tracking VOT2021 Challenge Results}},
booktitle = {2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021)},
year = {2021},
series = {IEEE International Conference on Computer Vision Workshops},
pages = {2711--2738},
publisher = {IEEE COMPUTER SOC},
}
Solving geometric tasks involving point clouds by using machine learning is a challenging problem. Standard feed-forward neural networks combine linear or, if the bias parameter is included, affine layers and activation functions. Their geometric modeling is limited, which motivated the prior work introducing the multilayer hypersphere perceptron (MLHP). Its constituent part, i.e., the hypersphere neuron, is obtained by applying a conformal embedding of Euclidean space. By virtue of Clifford algebra, it can be implemented as the Cartesian dot product of inputs and weights. If the embedding is applied in a manner consistent with the dimensionality of the input space geometry, the decision surfaces of the model units become combinations of hyperspheres and make the decision-making process geometrically interpretable for humans. Our extension of the MLHP model, the multilayer geometric perceptron (MLGP), and its respective layer units, i.e., geometric neurons, are consistent with the 3D geometry and provide a geometric handle of the learned coefficients. In particular, the geometric neuron activations are isometric in 3D, which is necessary for rotation and translation equivariance. When classifying the 3D Tetris shapes, we quantitatively show that our model requires no activation function in the hidden layers other than the embedding to outperform the vanilla multilayer perceptron. In the presence of noise in the data, our model is also superior to the MLHP.
@inproceedings{diva2:1641525,
author = {Melnyk, Pavlo and Felsberg, Michael and Wadenbäck, Mårten},
title = {{Embed Me If You Can: A Geometric Perceptron}},
booktitle = {Proceedings 2021 IEEE/CVF International Conference on Computer Vision ICCV 2021},
year = {2021},
series = {IEEE International Conference on Computer Vision. Proceedings},
pages = {1256--1264},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: Open World Object Detection, where a model is tasked to: 1) identify objects that have not been introduced to it as unknown, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterising unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-ofthe-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.
@inproceedings{diva2:1637943,
author = {Joseph, K. J. and Khan, Salman and Khan, Fahad and Balasubramanian, Vineeth N.},
title = {{Towards Open World Object Detection}},
booktitle = {2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021},
year = {2021},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {5826--5836},
publisher = {IEEE COMPUTER SOC},
}
We investigate spline-based continuous-time pose trajectory estimation using non-linear explicit motion priors. Current regularization priors either linearize the orientation, rely on the implicit regularization obtained from the used spline basis function, or use sampling based regularization schemes. The latter is a special case of a Riemann sum approximation, and we demonstrate when and why this can fail, and propose a way to avoid these issues. In addition we provide a number of novel practically useful theoretical contributions, including requirements on knot spacing for orientation splines, new basis functions for constant velocity extrapolation, and a generalization of the popular P-Spline penalty to orientation. We analyze the properties of the proposed approach using synthetic data. We validate our system using the standard task of visual-inertial calibration, and apply it to stereo visual odometry where we demonstrate real-time performance on KITTI.
@inproceedings{diva2:1635574,
author = {Persson, Mikael and Häger, Gustav and Ovr\'{e}n, Hannes and Forss\'{e}n, Per-Erik},
title = {{Practical Pose Trajectory Splines With Explicit Regularization}},
booktitle = {2021 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2021)},
year = {2021},
series = {International Conference on 3D Vision},
pages = {156--165},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Safe robot navigation in a dynamic environment, requires the trajectories of each independently moving object (IMO). We present the novel and effective system Sequential Hierarchical Ransac Estimation (Shire) designed for this purpose. The system uses a stereo camera stream to find the objects and trajectories in real time. Shire detects moving objects using geometric consistency and finds their trajectories using bundle adjustment. Relying on geometric consistency allows the system to handle objects regardless of semantic class, unlike approaches based on semantic segmentation. Most Visual Odometry (VO) systems are inherently limited to single motion by the choice of tracker. This limitation allows for efficient and robust ego-motion estimation in real time, but preclude tracking the multiple motions sought. Shire instead uses a generic tracker and achieves accurate VO and IMO estimates using track analysis. This removes the restriction to a single motion while retaining the real-time performance required for live navigation. We evaluate the system by bounding box intersection over union and ID persistence on a public dataset, collected from an autonomous test vehicle driving in real traffic. We also show the velocities of estimated IMOs. We investigate variations of the system that provide trade offs between accuracy, performance and limitations.
@inproceedings{diva2:1602445,
author = {Persson, Mikael and Forss\'{e}n, Per-Erik},
title = {{Independently Moving Object Trajectories from Sequential Hierarchical Ransac}},
booktitle = {VISAPP: PROCEEDINGS OF THE 16TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS - VOL. 5: VISAPP},
year = {2021},
pages = {722--731},
publisher = {SCITEPRESS},
}
Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosityabout these unknown instances aids in learning about them,when the corresponding knowledge is eventually available.This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a modelis tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to doso, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, whenthe corresponding labels are progressively received. Weformulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE:Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacyof ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterisingunknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-ofthe-art performance, with no extra methodological effort.We hope that our work will attract further research into thisnewly identified, yet crucial research direction.
@inproceedings{diva2:1602266,
author = {Joseph, KJ and Khan, Salman and Khan, Fahad Shahbaz and Balasubramanian, Vineeth N},
title = {{Towards Open World Object Detection}},
booktitle = {CVPR 2021, June 19-25 2021},
year = {2021},
}
We investigate a novel deep-learning-based approach to estimate uncertainty in stereo disparity prediction networks. Current state-of-the-art methods often formulate disparity prediction as a regression problem with a single scalar output in each pixel. This can be problematic in practical applications as in many cases there might not exist a single well defined disparity, for example in cases of occlusions or at depth-boundaries. While current neural-network-based disparity estimation approaches obtain good performance on benchmarks, the disparity prediction is treated as a black box at inference time. In this paper we show that by formulating the learning problem as a regression with a distribution target, we obtain a robust estimate of the uncertainty in each pixel, while maintaining the performance of the original method. The proposed method is evaluated both on a large-scale standard benchmark, as well on our own data. We also show that the uncertainty estimate significantly improves by maximizing the uncertainty in those pixels that have no well defined disparity during learning.
@inproceedings{diva2:1599550,
author = {Häger, Gustav and Persson, Mikael and Felsberg, Michael},
title = {{Predicting Disparity Distributions}},
booktitle = {2021 IEEE International Conference on Robotics and Automation (ICRA)},
year = {2021},
series = {IEEE International Conference on Robotics and Automation (ICRA)},
publisher = {IEEE},
}
Visual instance segmentation is a challenging problem and becomes even more difficult if objects of interest varies unconstrained in shape. Some objects are well described by a rectangle, however, this is hardly always the case. Consider for instance long, slender objects such as ropes. Anchor-based approaches classify predefined bounding boxes as either negative or positive and thus provide a limited set of shapes that can be handled. Defining anchor-boxes that fit well to all possible shapes leads to an infeasible number of prior boxes. We explore a different approach and propose to train a neural network to compute distance maps along different directions. The network is trained at each pixel to predict the distance to the closest object contour in a given direction. By pooling the distance maps we obtain an approximation to the signed distance function (SDF). The SDF may then be thresholded in order to obtain a foreground-background segmentation. We compare this segmentation to foreground segmentations obtained from the state-of-the-art instance segmentation method YOLACT. On the COCO dataset, our segmentation yields a higher performance in terms of foreground intersection over union (IoU). However, while the distance maps contain information on the individual instances, it is not straightforward to map them to the full instance segmentation. We still believe that this idea is a promising research direction for instance segmentation, as it better captures the different shapes found in the real world.
@inproceedings{diva2:1594795,
author = {Brissman, Emil and Johnander, Joakim and Felsberg, Michael},
title = {{Predicting Signed Distance Functions for Visual Instance Segmentation}},
booktitle = {33rd Annual Workshop of the Swedish-Artificial-Intelligence-Society (SAIS)},
year = {2021},
series = {Annual Workshop of the Swedish-Artificial-Intelligence-Society (SAIS)},
pages = {5--10},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
In this paper we present a novel algorithm for onboard radial distortion correction for unmanned aerial vehicles (UAVs) equipped with an inertial measurement unit (IMU), that runs in real-time. This approach makes calibration procedures redundant, thus allowing for exchange of optics extemporaneously. By utilizing the IMU data, the cameras can be aligned with the gravity direction. This allows us to work with fewer degrees of freedom, and opens up for further intrinsic calibration. We propose a fast and robust minimal solver for simultaneously estimating the focal length, radial distortion profile and motion parameters from homographies. The proposed solver is tested on both synthetic and real data, and perform better or on par with state-of-the-art methods relying on pre-calibration procedures. Code available at: https://github.com/marcusvaltonen/HomLib. 1
@inproceedings{diva2:1594706,
author = {Valtonen Örnhag, Marcus and Persson, Patrik and Wadenbäck, Mårten and Åström, Kalle and Heyden, Anders},
title = {{Efficient Real-Time Radial Distortion Correction for UAVs}},
booktitle = {2021 IEEE Winter Conference on Applications of Computer Vision (WACV)},
year = {2021},
series = {IEEE Winter Conference on Applications of Computer Vision (WACV)},
pages = {1750--1759},
}
In this paper we consider a collection of relative pose problems which arise naturally in applications for visual indoor navigation using unmanned aerial vehicles (UAVs). We focus on cases where additional information from an onboard IMU is available and thus provides a partial extrinsic calibration through the gravitational vector. The solvers are designed for a partially calibrated camera, for a variety of realistic indoor scenarios, which makes it possible to navigate using images of the ground floor. Current state-of-the-art solvers use more general assumptions, such as using arbitrary planar structures; however, these solvers do not yield adequate reconstructions for real scenes, nor do they perform fast enough to be incorporated in real-time systems. We show that the proposed solvers enjoy better numerical stability, are faster, and require fewer point correspondences, compared to state-of-the-art approaches. These properties are vital components for robust navigation in real-time systems, and we demonstrate on both synthetic and real data that our method outperforms other solvers, and yields superior motion estimation(1).
@inproceedings{diva2:1589920,
author = {Örnhag, Marcus Valtonen and Persson, Patrik and Wadenbäck, Mårten and Åström, Kalle and Heyden, Anders},
title = {{Minimal Solvers for Indoor UAV Positioning}},
booktitle = {2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)},
year = {2021},
series = {International Conference on Pattern Recognition},
pages = {1136--1143},
publisher = {IEEE COMPUTER SOC},
}
In this paper, we present a state-of-the-art reinforcement learning method for autonomous driving. Our approach employs temporal difference learning in a Bayesian framework to learn vehicle control signals from sensor data. The agent has access to images from a forward facing camera, which are pre-processed to generate semantic segmentation maps. We trained our system using both ground truth and estimated semantic segmentation input. Based on our observations from a large set of experiments, we conclude that training the system on ground truth input data leads to better performance than training the system on estimated input even if estimated input is used for evaluation. The system is trained and evaluated in a realistic simulated urban environment using the CARLA simulator. The simulator also contains a benchmark that allows for comparing to other systems and methods. The required training time of the system is shown to be lower and the performance on the benchmark superior to competing approaches.
@inproceedings{diva2:1589919,
author = {Gharaee, Zahra and Holmquist, Karl and He, Linbo and Felsberg, Michael},
title = {{A Bayesian Approach to Reinforcement Learning of Vision-Based Vehicular Control}},
booktitle = {2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)},
year = {2021},
series = {International Conference on Pattern Recognition},
pages = {3947--3954},
publisher = {IEEE COMPUTER SOC},
}
Optical flow is a regression task where convolutional neural networks (CNNs) have led to major breakthroughs. However, this comes at major computational demands due to the use of cost-volumes and pyramidal representations. This was mitigated by producing flow predictions at quarter the resolution, which are upsampled using bilinear interpolation during test time. Consequently, fine details are usually lost and post-processing is needed to restore them. We propose the Normalized Convolution UPsampler (NCUP), an efficient joint upsampling approach to produce the full-resolution flow during the training of optical flow CNNs. Our proposed approach formulates the upsampling task as a sparse problem and employs the normalized convolutional neural networks to solve it. We evaluate our upsampler against existing joint upsampling approaches when trained end-to-end with a a coarse-to-fine optical flow CNN (PWCNet) and we show that it outperforms all other approaches on the FlyingChairs dataset while having at least one order fewer parameters. Moreover, we test our upsampler with a recurrent optical flow CNN (RAFT) and we achieve state-of-the-art results on Sintel benchmark with ∼ 6% error reduction, and on-par on the KITTI dataset, while having 7.5% fewer parameters (see Figure 1). Finally, our upsampler shows better generalization capabilities than RAFT when trained and evaluated on different datasets.
@inproceedings{diva2:1557589,
author = {Eldesokey, Abdelrahman and Felsberg, Michael},
title = {{Normalized Convolution Upsampling for Refined Optical Flow Estimation}},
booktitle = {Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications},
year = {2021},
series = {VISIGRAPP},
pages = {742--752},
publisher = {SciTePress},
}
Semi-supervised video object segmentation is a challenging task that aims to segment a target throughout a video sequence given an initial mask at the first frame. Discriminative approaches have demonstrated competitive performance on this task at a sensible complexity. These approaches typically formulate the problem as a one-versus-one classification between the target and the background. However, in reality, a video sequence usually encompasses a target, background, and possibly other distracting objects. Those objects increase the risk of introducing false positives, especially if they share visual similarities with the target. Therefore, it is more effective to separate distractors from the background, and handle them independently.
We propose a one-versus-many scheme to address this situation by separating distractors into their own class. This separation allows imposing special attention to challenging regions that are most likely to degrade the performance. We demonstrate the prominence of this formulation by modifying the learning-what-to-learn method to be distractor-aware. Our proposed approach sets a new state-of-the-art on the DAVIS val dataset, and improves over the baseline on the DAVIS test-dev benchmark by 4.8 percent points.
@inproceedings{diva2:1545384,
author = {Robinson, Andreas and Eldesokey, Abdelrahman and Felsberg, Michael},
title = {{Distractor-aware video object segmentation}},
booktitle = {Pattern Recognition. DAGM GCPR 2021},
year = {2021},
series = {Lecture Notes in Computer Science},
volume = {13024},
pages = {222--234},
}
The Visual Object Tracking challenge VOT2020 is the eighth annual tracker benchmarking activity organized by the VOT initiative. Results of 58 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The VOT2020 challenge was composed of five sub-challenges focusing on different tracking domains: (i) VOT-ST2020 challenge focused on short-term tracking in RGB, (ii) VOT-RT2020 challenge focused on “real-time” short-term tracking in RGB, (iii) VOT-LT2020 focused on long-term tracking namely coping with target disappearance and reappearance, (iv) VOT-RGBT2020 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2020 challenge focused on long-term tracking in RGB and depth imagery. Only the VOT-ST2020 datasets were refreshed. A significant novelty is introduction of a new VOT short-term tracking evaluation methodology, and introduction of segmentation ground truth in the VOT-ST2020 challenge – bounding boxes will no longer be used in the VOT-ST challenges. A new VOT Python toolkit that implements all these novelites was introduced. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net ).
@inproceedings{diva2:1599875,
author = {Kristan, M. and Leonardis, A. and Matas, J. and Felsberg, Michael and Pflugfelder, R. and Kämäräinen, J.-K. and Danelljan, M. and Zajc, L.C. and Lukežic, A. and Drbohlav, O. and He, Linbo and Zhang, Yushan and Yan, S. and Yang, J. and Fernández, G. and Hauptmann, A. and Memarmoghadam, A. and García-Martín, Á. and Robinson, Andreas and Varfolomieiev, A. and Gebrehiwot, A.H. and Uzun, B. and Yan, B. and Li, B. and Qian, C. and Tsai, C.-Y. and Micheloni, C. and Wang, D. and Wang, F. and Xie, F. and Järemo-Lawin, Felix and Gustafsson, F. and Foresti, G.L. and Bhat, G. and Chen, G. and Ling, H. and Zhang, H. and Cevikalp, H. and Zhao, H. and Bai, H. and Kuchibhotla, H.C. and Saribas, H. and Fan, H. and Ghanei-Yakhdan, H. and Li, H. and Peng, H. and Lu, H. and Li, H. and Khaghani, J. and Bescos, J. and Li, J. and Fu, J. and Yu, J. and Xu, J. and Kittler, J. and Yin, J. and Lee, J. and Yu, K. and Liu, K. and Yang, K. and Dai, K. and Cheng, L. and Zhang, L. and Wang, L. and Wang, L. and Van, Gool L. and Bertinetto, L. and Dunnhofer, M. and Cheng, M. and Dasari, M.M. and Wang, N. and Wang, N. and Zhang, P. and Torr, P.H.S. and Wang, Q. and Timofte, R. and Gorthi, R.K.S. and Choi, S. and Marvasti-Zadeh, S.M. and Zhao, S. and Kasaei, S. and Qiu, S. and Chen, S. and Schön, T.B. and Xu, T. and Lu, W. and Hu, W. and Zhou, W. and Qiu, X. and Ke, X. and Wu, X.-J. and Zhang, X. and Yang, X. and Zhu, X. and Jiang, Y. and Wang, Y. and Chen, Y. and Ye, Y. and Li, Y. and Yao, Y. and Lee, Y. and Gu, Y. and Wang, Z. and Tang, Z. and Feng, Z.-H. and Mai, Z. and Zhang, Z. and Wu, Z. and Ma, Z.},
title = {{The Eighth Visual Object Tracking VOT2020 Challenge Results}},
booktitle = {Computer Vision},
year = {2020},
series = {Lecture Notes in Computer Science},
volume = {12539},
pages = {547--601},
}
Unsupervised learning of anomaly detection in high-dimensional data, such as images, is a challenging problem recently subject to intense research. Through careful modelling of the data distribution of normal samples, it is possible to detect deviant samples, so called anomalies. Generative Adversarial Networks (GANs) can model the highly complex, high-dimensional data distribution of normal image samples, and have shown to be a suitable approach to the problem. Previously published GAN-based anomaly detection methods often assume that anomaly-free data is available for training. However, this assumption is not valid in most real-life scenarios, a.k.a. in the wild. In this work, we evaluate the effects of anomaly contaminations in the training data on state-of-the-art GAN-based anomaly detection methods. As expected, detection performance deteriorates. To address this performance drop, we propose to add an additional encoder network already at training time and show that joint generator-encoder training stratifies the latent space, mitigating the problem with contaminated data. We show experimentally that the norm of a query image in this stratified latent space becomes a highly significant cue to discriminate anomalies from normal data. The proposed method achieves state-of-the-art performance on CIFAR-10 as well as on a large, previously untested dataset with cell images.
@inproceedings{diva2:1539624,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Unsupervised Adversarial Learning of Anomaly Detection in the Wild}},
booktitle = {Proceedings of the 24th European Conference on Artificial Intelligence (ECAI)},
year = {2020},
series = {Frontiers in Artificial Intelligence and Applications},
volume = {325},
pages = {1002--1008},
publisher = {IOS Press},
address = {Amsterdam},
}
Probabilistic methods for point set registration have interesting theoretical properties, such as linear complexity in the number of used points, and they easily generalize to joint registration of multiple point sets. In this work, we improve their recognition performance to match state of the art. This is done by incorporating learned features, by adding a von Mises-Fisher feature model in each mixture component, and by using learned attention weights. We learn these jointly using a registration loss learning strategy (RLL) that directly uses the registration error as a loss, by back-propagating through the registration iterations. This is possible as the probabilistic registration is fully differentiable, and the result is a learning framework that is truly end-to-end. We perform extensive experiments on the 3DMatch and Kitti datasets. The experiments demonstrate that our approach benefits significantly from the integration of the learned features and our learning strategy, outperforming the state-of-the-art on Kitti. Code is available at https://github.com/felja633/RLLReg.
@inproceedings{diva2:1530341,
author = {Järemo-Lawin, Felix and Forss\'{e}n, Per-Erik},
title = {{Registration Loss Learning for Deep Probabilistic Point Set Registration}},
booktitle = {2020 International Conference on 3D Vision (3DV)},
year = {2020},
series = {International Conference on 3D Vision},
pages = {563--572},
publisher = {IEEE},
}
Recent innovations in microelectronic and semiconductor technology enable the creation of smaller and economical hyperspectral cameras. A filter combined camera with advanced scanning module is a game changer that extends the application of miniature hyperspectral imagers to many security domains. This work presents the assessment of the imager L4 from Glana Sensors to detect concealed targets in woodland areas. Several target detection methods were applied to a collection of scenes acquired under various illumination conditions and containing different materials. The potential and limitations of this new imaging device in the context of difficult target detection in forested area are evaluated and discussed.
@inproceedings{diva2:1470528,
author = {Gonzalez, Santiago A. Rodriguez and Shimoni, Michal and Plaza, Javier and Plaza, Antonio and Renhorn, Ingmar and Ahlberg, Jörgen},
title = {{The Detection of Concealed Targets in Woodland Areas using Hyperspectral Imagery}},
booktitle = {2020 IEEE Latin American GRSS \& ISPRS Remote Sensing Conference (LAGIRS)},
year = {2020},
pages = {451--455},
publisher = {IEEE},
address = {Santiago, Chile},
}
The focus in deep learning research has been mostly to push the limits of prediction accuracy. However, this was often achieved at the cost of increased complexity, raising concerns about the interpretability and the reliability of deep networks. Recently, an increasing attention has been given to untangling the complexity of deep networks and quantifying their uncertainty for different computer vision tasks. Differently, the task of depth completion has not received enough attention despite the inherent noisy nature of depth sensors. In this work, we thus focus on modeling the uncertainty of depth data in depth completion starting from the sparse noisy input all the way to the final prediction. We propose a novel approach to identify disturbed measurements in the input by learning an input confidence estimator in a self-supervised manner based on the normalized convolutional neural networks (NCNNs). Further, we propose a probabilistic version of NCNNs that produces a statistically meaningful uncertainty measure for the final prediction. When we evaluate our approach on the KITTI dataset for depth completion, we outperform all the existing Bayesian Deep Learning approaches in terms of prediction accuracy, quality of the uncertainty measure, and the computational efficiency. Moreover, our small network with 670k parameters performs on-par with conventional approaches with millions of parameters. These results give strong evidence that separating the network into parallel uncertainty and prediction streams leads to state-of-the-art performance with accurate uncertainty estimates.
@inproceedings{diva2:1465186,
author = {Eldesokey, Abdelrahman and Felsberg, Michael and Holmquist, Karl and Persson, Mikael},
title = {{Uncertainty-Aware CNNs for Depth Completion:
Uncertainty from Beginning to End}},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020},
series = {Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {12011--12020},
publisher = {IEEE},
}
Video object segmentation (VOS) is a highly challengingproblem, since the target object is only defined by a first-frame refer-ence mask during inference. The problem of how to capture and utilizethis limited information to accurately segment the target remains a fun-damental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shotlearner. Our learner is designed to predict a powerful parametric modelof the target by minimizing a segmentation error in the first frame. Wefurther go beyond the standard few-shot learning paradigm by learningwhat our target model should learn in order to maximize segmentationaccuracy. We perform extensive experiments on standard benchmarks.Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding toa 2.6% relative improvement over the previous best result. The code andmodels are available at https://github.com/visionml/pytracking.
@inproceedings{diva2:1462283,
author = {Goutam, Bhat and Järemo-Lawin, Felix and Danelljan, Martin and Robinson, Andreas and Felsberg, Michael and Van Gool, Luc and Timofte, Radu},
title = {{Learning What to Learn for Video Object Segmentation}},
booktitle = {Computer Vision},
year = {2020},
series = {Lecture Notes in Computer Science},
volume = {12347},
pages = {777--794},
}
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time. The main difficulty is to effectively handle appearance changes and similar background objects, while maintaining accurate segmentation. Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting. More recent methods integrate generative target appearance models, but either achieve limited robustness or require large amounts of training data. We propose a novel VOS architecture consisting of two network components. The target appearance model consists of a light-weight module, which is learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks. Our method is fast, easily trainable and remains highly effective in cases of limited training data. We perform extensive experiments on the challenging YouTube-VOS and DAVIS datasets. Our network achieves favorable performance, while operating at higher frame-rates compared to state-of-the-art. Code and trained models are available at https://github.com/andr345/frtm-vos.
@inproceedings{diva2:1458627,
author = {Robinson, Andreas and Järemo-Lawin, Felix and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Learning Fast and Robust Target Models for Video Object Segmentation}},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020},
series = {Computer Society Conference on Computer Vision and Pattern Recognition},
pages = {7404--7413},
publisher = {IEEE},
}
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems e.g., for classification, segmentation and object detection. The vulnerability of DNNs against such attacks can prove a major roadblock towards their real-world deployment. Transferability of adversarial examples demand generalizable defenses that can provide cross-task protection. Adversarial training that enhances robustness by modifying target model’s parameters lacks such generalizability. On the other hand, different input processing based defenses fall short in the face of continuously evolving attacks. In this paper, we take the first step to combine the benefits of both approaches and propose a self-supervised adversarial training mechanism in the input space. By design, our defense is a generalizable approach and provides significant robustness against the unseen adversarial attacks (\eg by reducing the success rate of translation-invariant ensemble attack from 82.6% to 31.9% in comparison to previous state-of-the-art). It can be deployed as a plug-and-play solution to protect a variety of vision systems, as we demonstrate for the case of classification, segmentation and detection.
@inproceedings{diva2:1458576,
author = {Naseer, M. and Khan, S. and Hayat, M. and Khan, Fahad Shahbaz and Porikli, F.},
title = {{A Self-supervised Approach for Adversarial Robustness}},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020},
series = {Computer Society Conference on Computer Vision and Pattern Recognition},
pages = {259--268},
publisher = {IEEE},
}
Understanding interactions between humans and objects is one of the fundamental problems in visual classification and an essential step towards detailed scene understanding. Human-object interaction (HOI) detection strives to localize both the human and an object as well as the identification of complex interactions between them. Most existing HOI detection approaches are instance-centric where interactions between all possible human-object pairs are predicted based on appearance features and coarse spatial information. We argue that appearance features alone are insufficient to capture complex human-object interactions. In this paper, we therefore propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Paired with the densely predicted interaction vectors, the interactions are associated with human and object detections to obtain final predictions. To the best of our knowledge, we are the first to propose an approach where HOI detection is posed as a keypoint detection and grouping problem. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET. Our approach sets a new state-of-the-art on both datasets. Code is available at https://github.com/vaesl/IP-Net.
@inproceedings{diva2:1458572,
author = {Wang, T. and Yang, T. and Danelljan, M. and Khan, Fahad Shahbaz and Zhang, X. and Sun, J.},
title = {{Learning Human-Object Interaction Detection Using Interaction Points}},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020},
series = {Computer Society Conference on Computer Vision and Pattern Recognition},
pages = {4115--4124},
publisher = {IEEE},
}
One of the attractive characteristics of deep neural networks is their ability to transfer knowledge obtained in one domain to other related domains. As a result, high-quality networks can be trained in domains with relatively little training data. This property has been extensively studied for discriminative networks but has received significantly less attention for generative models. Given the often enormous effort required to train GANs, both computationally as well as in the dataset collection, the re-use of pretrained GANs is a desirable objective. We propose a novel knowledge transfer method for generative models based on mining the knowledge that is most beneficial to a specific target domain, either from a single or multiple pretrained GANs. This is done using a miner network that identifies which part of the generative distribution of each pretrained GAN outputs samples closest to the target domain. Mining effectively steers GAN sampling towards suitable regions of the latent space, which facilitates the posterior finetuning and avoids pathologies of other methods such as mode collapse and lack of flexibility. We perform experiments on several complex datasets using various GAN architectures (BigGAN, Progressive GAN) and show that the proposed method, called MineGAN, effectively transfers knowledge to domains with few target images, outperforming existing methods. In addition, MineGAN can successfully transfer knowledge from multiple pretrained GANs. Our code is available at: https://github.com/yaxingwang/MineGAN.
@inproceedings{diva2:1458547,
author = {Wang, Y. and Gonzalez-Garcia, A. and Berga, D. and Herranz, L. and Khan, Fahad Shahbaz and Weijer, J. van de},
title = {{MineGAN:
Effective Knowledge Transfer From GANs to Target Domains With Few Images}},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020},
series = {Computer Society Conference on Computer Vision and Pattern Recognition},
pages = {9329--9338},
publisher = {IEEE},
}
In the last few years, unpaired image-to-image translation has witnessed Remarkable progress. Although the latest methods are able to generate realistic images, they crucially rely on a large number of labeled images. Recently, some methods have tackled the challenging setting of few-shot image-to-image ranslation, reducing the labeled data requirements for the target domain during inference. In this work, we go one step further and reduce the amount of required labeled data also from the source domain during training. To do so, we propose applying semi-supervised learning via a noise-tolerant pseudo-labeling procedure. We also apply a cycle consistency constraint to further exploit the information from unlabeled images, either from the same dataset or external. Additionally, we propose several structural modifications to facilitate the image translation task under these circumstances. Our semi-supervised method for few-shot image translation, called SEMIT, achieves excellent results on four different datasets using as little as 10% of the source labels, and matches the performance of the main fully-supervised competitor using only 20% labeled data. Our code and models are made public at: https://github.com/yaxingwang/SEMIT.
@inproceedings{diva2:1458539,
author = {Wang, Y. and Khan, S. and Gonzalez-Garcia, A. and Weijer, J. van de and Khan, Fahad Shahbaz},
title = {{Semi-Supervised Learning for Few-Shot Image-to-Image Translation}},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020},
series = {Computer Society Conference on Computer Vision and Pattern Recognition},
pages = {4452--4461},
publisher = {IEEE},
}
Humans can continuously learn new knowledge as their experience grows. In contrast, previous learning in deep neural networks can quickly fade out when they are trained on a new task. In this paper, we hypothesize this problem can be avoided by learning a set of generalized parameters, that are neither specific to old nor new tasks. In this pursuit, we introduce a novel meta-learning approach that seeks to maintain an equilibrium between all the encountered tasks. This is ensured by a new meta-update rule which avoids catastrophic forgetting. In comparison to previous meta-learning techniques, our approach is task-agnostic. When presented with a continuum of data, our model automatically identifies the task and quickly adapts to it with just a single update. We perform extensive experiments on five datasets in a class-incremental setting, leading to significant improvements over the state of the art methods (e.g., a 21.3% boost on CIFAR100 with 10 incremental tasks). Specifically, on large-scale datasets that generally prove difficult cases for incremental learning, our approach delivers absolute gains as high as 19.1% and 7.4% on ImageNet and MS-Celeb datasets, respectively.
@inproceedings{diva2:1458536,
author = {Rajasegaran, J. and Khan, S. and Hayat, M. and Khan, Fahad Shahbaz and Shah, M.},
title = {{iTAML:
An Incremental Task-Agnostic Meta-learning Approach}},
booktitle = {2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020},
series = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
pages = {13585--13594},
publisher = {IEEE},
}
This paper presents a method for detecting independently moving objects (IMOs) from a monocular camera mounted on a moving car. We use an existing state of the art monocular sparse visual odometry/SLAM framework, and specifically attack the notorious problem of identifying those IMOs which move parallel to the ego-car motion, that is, in an `epipolar-conformant way. IMO candidate patches are obtained from an existing CNN-based car instance detector. While crossing IMOs can be identified as such by epipolar consistency checks, IMOs that move parallel to the camera motion are much harder to detect as their epipolar conformity allows to misinterpret them as static objects in a wrong distance. We employ a CNN to provide an appearance-based depth estimate, and the ambiguity problem can be solved through depth verification. The obtained motion labels (IMO/static) are then propagated over time using the combination of motion cues and appearance-based information of the IMO candidate patches. We evaluate the performance of our method on the KITTI dataset.
@inproceedings{diva2:1514674,
author = {Fanani, Nolang and Ochs, Matthias and Mester, Rudolf},
title = {{Detecting Parallel-Moving Objects in the Monocular Case Employing CNN Depth Maps}},
booktitle = {COMPUTER VISION - ECCV 2018 WORKSHOPS, PT III},
year = {2019},
series = {Lecture Notes in Computer Science},
pages = {281--297},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by the VOT initiative. Results of 81 trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis as well as the standard VOT methodology for long-term tracking analysis. The VOT2019 challenge was composed of five challenges focusing on different tracking domains: (i) VOT-ST2019 challenge focused on short-term tracking in RGB, (ii) VOT-RT2019 challenge focused on "real-time" short-term tracking in RGB, (iii) VOT-LT2019 focused on long-term tracking namely coping with target disappearance and reappearance. Two new challenges have been introduced: (iv) VOT-RGBT2019 challenge focused on short-term tracking in RGB and thermal imagery and (v) VOT-RGBD2019 challenge focused on long-term tracking in RGB and depth imagery. The VOT-ST2019, VOT-RT2019 and VOT-LT2019 datasets were refreshed while new datasets were introduced for VOT-RGBT2019 and VOT-RGBD2019. The VOT toolkit has been updated to support both standard short-term, long-term tracking and tracking with multi-channel imagery. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website(1).
@inproceedings{diva2:1466584,
author = {Kristanl, Matej and Matas, Jiri and Leonardis, Ales and Felsberg, Michael and Pflugfelder, Roman and Kamarainen, Joni-Kristian and Zajc, Luka Cehovin and Drbohlav, Ondrej and Lukezic, Alan and Berg, Amanda and Eldesokey, Abdelrahman and Kapyla, Jani and Fernandez, Gustavo and Gonzalez-Garcia, Abel and Memarrnoghadam, Alireza and Lu, Andong and He, Anfeng and Varfolomieiev, Anton and Chan, Antoni and Tripathi, Ardhendu Shekhar and Smeulders, Arnold and Pedasingu, Bala Suraj and Chen, Bao Xin and Zhang, Baopeng and Wu, Baoyuan and Li, Bi and He, Bin and Yan, Bin and Bai, Bing and Li, Bing and Li, Bo and Kim, Bycong Hak and Ma, Chao and Fang, Chen and Qian, Chen and Chen, Cheng and Li, Chenglong and Zhang, Chengquan and Tsai, Chi-Yi and Luo, Chong and Micheloni, Christian and Zhang, Chunhui and Tao, Dacheng and Gupta, Deepak and Song, Dejia and Wang, Dong and Gavves, Efstratios and Yi, Eunu and Khan, Fahad Shahbaz and Zhang, Fangyi and Wang, Fei and Zhao, Fei and De Ath, George and Bhat, Goutam and Chen, Guanqi and Wang, Guangting and Li, Guoxuan and Cevikalp, Hakan and Du, Hao and Zhao, Haojie and Saribas, Hasan and Jung, Ho Min and Bai, Hongliang and Yu, Hongyuan and Peng, Houwen and Lu, Huchuan and Li, Hui and Li, Jiakun and Li, Jianhu and Fu, Jianlong and Chen, Jie and Gao, Jie and Zhao, Jie and Tang, Jin and Li, Jing and Wu, Jingjing and Liu, Jingtuo and Wang, Jinqiao and Qi, Jingqing and Zhang, Jingyue and Tsotsos, John K. and Lee, John Hyuk and van de Weijer, Joost and Kittler, Josef and Lee, Jun Ha and Zhuang, Junfei and Zhang, Kangkai and wang, Kangkang and Dai, Kenan and Chen, Lei and Liu, Lei and Guo, Leida and Zhang, Li and Wang, Liang and Wang, Liangliang and Zhang, Lichao and Wang, Lijun and Zhou, Lijun and Zheng, Linyu and Rout, Litu and Van Gool, Luc and Bertinetto, Luca and Danelljan, Martin and Dunnhofer, Matteo and Ni, Meng and Kim, Min Young and Tang, Ming and Yang, Ming-Hsuan and Paluru, Naveen and Martine, Niki and Xu, Pengfei and Zhang, Pengfei and Zheng, Pengkun and Zhang, Pengyu and Torr, Philip H. S. and Wang, Qi Zhang Qiang and Gua, Qing and Timofte, Radu and Gorthi, Rama Krishna and Everson, Richard and Han, Ruize and Zhang, Ruohan and You, Shan and Zhao, Shao-Chuan and Zhao, Shengwei and Li, Shihu and Li, Shikun and Ge, Shiming and Bai, Shuai and Guan, Shuosen and Xing, Tengfei and Xu, Tianyang and Yang, Tianyu and Zhang, Ting and Vojir, Tomas and Feng, Wei and Hu, Weiming and Wang, Weizhao and Tang, Wenjie and Zeng, Wenjun and Liu, Wenyu and Chen, Xi and Qiu, Xi and Bai, Xiang and Wu, Xiao-Jun and Yang, Xiaoyun and Chen, Xier and Li, Xin and Sun, Xing and Chen, Xingyu and Tian, Xinmei and Tang, Xu and Zhu, Xue-Feng and Huang, Yan and Chen, Yanan and Lian, Yanchao and Gu, Yang and Liu, Yang and Chen, Yanjie and Zhang, Yi and Xu, Yinda and Wang, Yingming and Li, Yingping and Zhou, Yu and Dong, Yuan and Xu, Yufei and Zhang, Yunhua and Li, Yunkun and Luo, Zeyu Wang Zhao and Zhang, Zhaoliang and Feng, Zhen-Hua and He, Zhenyu and Song, Zhichao and Chen, Zhihao and Zhang, Zhipeng and Wu, Zhirong and Xiong, Zhiwei and Huang, Zhongjian and Teng, Zhu and Ni, Zihan},
title = {{The Seventh Visual Object Tracking VOT2019 Challenge Results}},
booktitle = {2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW)},
year = {2019},
series = {IEEE International Conference on Computer Vision Workshops},
pages = {2206--2241},
publisher = {IEEE COMPUTER SOC},
}
Convolutional neural networks (CNNs) have recently achieved outstanding results for various vision tasks, including indoor scene understanding. The de facto practice employed by state-of-the-art indoor scene recognition approaches is to use RGB pixel values as input to CNN models that are trained on large amounts of labeled data (Image-Net or Places). Here, we investigate CNN architectures by augmenting RGB images with estimated depth and texture information, as multiple streams, for monocular indoor scene recognition. First, we exploit the recent advancements in the field of depth estimation from monocular images and use the estimated depth information to train a CNN model for learning deep depth features. Second, we train a CNN model to exploit the successful Local Binary Patterns (LBP) by using mapped coded images with explicit LBP encoding to capture texture information available in indoor scenes. We further investigate different fusion strategies to combine the learned deep depth and texture streams with the traditional RGB stream. Comprehensive experiments are performed on three indoor scene classification benchmarks: MIT-67, OCIS and SUN-397. The proposed multi-stream network significantly outperforms the standard RGB network by achieving an absolute gain of 9.3%, 4.7%, 7.3% on the MIT-67, OCIS and SUN-397 datasets respectively.
@inproceedings{diva2:1466223,
author = {Anwer, Rao Muhammad and Khan, Fahad and Laaksonen, Jorma and Zaki, Nazar},
title = {{Multi-stream Convolutional Networks for Indoor Scene Recognition}},
booktitle = {COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2019, PT I},
year = {2019},
series = {Lecture Notes in Computer Science},
pages = {196--208},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
Common object counting in a natural scene is a challenging problem in computer vision with numerous real-world applications. Existing image-level supervised common object counting approaches only predict the global object count and rely on additional instance-level supervision to also determine object locations. We propose an image-level supervised approach that provides both the global object count and the spatial distribution of object instances by constructing an object category density map. Motivated by psychological studies, we further reduce image-level supervision using a limited object count information (up to four). To the best of our knowledge, we are the first to propose image-level supervised density map estimation for common object counting and demonstrate its effectiveness in image-level supervised instance segmentation. Comprehensive experiments are performed on the PASCAL VOC and COCO datasets. Our approach outperforms existing methods, including those using instance-level supervision, on both datasets for common object counting. Moreover, our approach improves state-of-the-art image-level supervised instance segmentation [34] with a relative gain of 17.8% in terms of average best overlap, on the PASCAL VOC 2012 dataset.
@inproceedings{diva2:1458518,
author = {Cholakkal, Hisham and Sun, Guolei and Khan, Fahad Shahbaz and Shao, Ling},
title = {{Object Counting and Instance Segmentation with Image-level Supervision}},
booktitle = {2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), Long Beach, CA, JUN 16-20, 2019},
year = {2019},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {12389--12397},
publisher = {IEEE},
}
Single-stage object detectors have recently gained popularity due to their combined advantage of high detection accuracy and real-time speed. However, while promising results have been achieved by these detectors on standard-sized objects, their performance on small objects is far from satisfactory. To detect very small/large objects, classical pyramid representation can be exploited, where an image pyramid is used to build afeature pyramid (featurized image pyramid), enabling detection across a range of scales. Existing single-stage detectors avoid such afeaturized image pyramid representation due to its memory and time complexity. In this paper we introduce a light-weight architecture to efficiently produce featurized image pyramid in a single-stage detection framework. The resulting multi-scale features are then injected into the prediction layers of the detector using an attention module. The performance of our detector is validated on two benchmarks: PASCAL VOC and MS COCO. For a 300 x 300 input, our detector operates at 111 frames per second (FPS) on a Titan X GPU, providing state-of-the-art detection accuracy on PASCAL VOC 2007 testset. On the MS COCO testset, our detector achieves state-of-the-art results surpassing all existing single-stage methods in the case of single-scale inference.
@inproceedings{diva2:1458515,
author = {Pang, Yanwei and Wang, Tiancai and Anwer, Rao Muhammad and Khan, Fahad Shahbaz and Shao, Ling},
title = {{Efficient Featurized Image Pyramid Network for Single Shot Detector}},
booktitle = {2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), Long Beach, CA, JUN 16-20, 2019},
year = {2019},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {7328--7336},
publisher = {IEEE},
}
Pedestrian detection relying on deep convolution neural networks has made significant progress. Though promising results have been achieved on standard pedestrians, the performance on heavily occluded pedestrians remains far from satisfactory. The main culprits are intra-class occlusions involving other pedestrians and inter-class occlusions caused by other objects, such as cars and bicycles. These result in a multitude of occlusion patterns. We propose an approach for occluded pedestrian detection with the following contributions. First, we introduce a novel mask-guided attention network that fits naturally into popular pedestrian detection pipelines. Our attention network emphasizes on visible pedestrian regions while suppressing the occluded ones by modulating full body features. Second, we empirically demonstrate that coarse-level segmentation annotations provide reasonable approximation to their dense pixel-wise counterparts. Experiments are performed on CityPersons and Caltech datasets. Our approach sets a new state-of-the-art on both datasets. Our approach obtains an absolute gain of 9.5% in log-average miss rate, compared to the best reported results [31] on the heavily occluded HO pedestrian set of CityPersons test set. Further, on the HO pedestrian set of Caltech dataset, our method achieves an absolute gain of 5.0% in log-average miss rate, compared to the best reported results [13]. Code and models are available at: https://github.com/Leotju/MGAN.
@inproceedings{diva2:1458513,
author = {Pang, Yanwei and Xie, Jin and Khan, Muhammad Haris and Anwer, Rao Muhammad and Khan, Fahad Shahbaz and Shao, Ling},
title = {{Mask-Guided Attention Network for Occluded Pedestrian Detection}},
booktitle = {2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019)},
year = {2019},
series = {IEEE International Conference on Computer Vision},
pages = {4966--4974},
publisher = {IEEE COMPUTER SOC},
}
Siamese approaches address the visual tracking problem by extracting an appearance template from the current frame, which is used to localize the target in the next frame. In general, this template is linearly combined with the accumulated template from the previous frame, resulting in an exponential decay of information over time. While such an approach to updating has led to improved results, its simplicity limits the potential gain likely to be obtained by learning to update. Therefore, we propose to replace the handcrafted update function with a method which learns to update. We use a convolutional neural network, called UpdateNet, which given the initial template, the accumulated template and the template of the current frame aims to estimate the optimal template for the next frame. The UpdateNet is compact and can easily be integrated into existing Siamese trackers. We demonstrate the generality of the proposed approach by applying it to two Siamese trackers, SiamFC and DaSiamRPN. Extensive experiments on VOT2016, VOT2018, LaSOT, and TrackingNet datasets demonstrate that our UpdateNet effectively predicts the new target template, outperforming the standard linear update. On the large-scale TrackingNet dataset, our UpdateNet improves the results of DaSiamRPN with an absolute gain of 3.9% in terms of success score. Code and models are available at https://github.com/zhanglichao/updatenet.
@inproceedings{diva2:1458510,
author = {Zhang, Lichao and Gonzalez-Garcia, Abel and van de Weijer, Joost and Danelljan, Martin and Khan, Fahad Shahbaz},
title = {{Learning the Model Update for Siamese Trackers}},
booktitle = {2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), Seoul, SOUTH KOREA, OCT 27-NOV 02, 2019},
year = {2019},
series = {IEEE International Conference on Computer Vision},
pages = {4009--4018},
publisher = {IEEE COMPUTER SOC},
}
Adversarial examples reveal the blind spots of deep neural networks (DNNs) and represent a major concern for security-critical applications. The transferability of adversarial examples makes real-world attacks possible in black-box settings, where the attacker is forbidden to access the internal parameters of the model. The underlying assumption in most adversary generation methods, whether learning an instance-specific or an instance-agnostic perturbation, is the direct or indirect reliance on the original domain-specific data distribution. In this work, for the first time, we demonstrate the existence of domain-invariant adversaries, thereby showing common adversarial space among different datasets and models. To this end, we propose a framework capable of launching highly transferable attacks that crafts adversarial patterns to mislead networks trained on entirely different domains. For instance, an adversarial function learned on Paintings, Cartoons or Medical images can successfully perturb ImageNet samples to fool the classifier, with success rates as high as similar to 99% (l(infinity) <= 10). The core of our proposed adversarial function is a generative network that is trained using a relativistic supervisory signal that enables domain-invariant perturbations. Our approach sets the new state-of-the-art for fooling rates, both under the white-box and black-box scenarios. Furthermore, despite being an instance-agnostic perturbation function, our attack outperforms the conventionally much stronger instance-specific attack methods.
@inproceedings{diva2:1454554,
author = {Naseer, Muzammal and Khan, Salman and Khan, Muhammad Haris and Khan, Fahad and Porikli, Fatih},
title = {{Cross-Domain Transferability of Adversarial Perturbations}},
booktitle = {ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019)},
year = {2019},
series = {Advances in Neural Information Processing Systems},
publisher = {NEURAL INFORMATION PROCESSING SYSTEMS (NIPS)},
}
In this work, we address the problem of semi-supervised video object segmentation, where the task is to segment a target object in every image of the video sequence, given a ground truth only in the first frame. To be successful it is crucial to robustly handle unpredictable target appearance changes and distracting objects in the background. In this work we obtain a robust and efficient representation of the target by integrating a fast and light-weight discriminative target model into a deep segmentation network. Trained during inference, the target model learns to discriminate between the local appearances of target and background image regions. Its predictions are enhanced to accurate segmentation masks in a subsequent refinement stage.To further improve the segmentation performance, we add a new module trained to generate global target attention vectors, given the input mask and image feature maps. The attention vectors add semantic information about thetarget from a previous frame to the refinement stage, complementing the predictions provided by the target appearance model. Our method is fast and requires no network fine-tuning. We achieve a combined J and F-score of 70.6 on the DAVIS 2019 test-challenge data
@inproceedings{diva2:1390580,
author = {Robinson, Andreas and Järemo-Lawin, Felix and Danelljan, Martin and Felsberg, Michael},
title = {{Discriminative Learning and Target Attention for the 2019 DAVIS Challenge onVideo Object Segmentation}},
booktitle = {CVPR 2019 workshops},
year = {2019},
}
While recent years have witnessed astonishing improvements in visual tracking robustness, the advancements in tracking accuracy have been limited. As the focus has been directed towards the development of powerful classifiers, the problem of accurate target state estimation has been largely overlooked. In fact, most trackers resort to a simple multi-scale search in order to estimate the target bounding box. We argue that this approach is fundamentally limited since target estimation is a complex task, requiring highlevel knowledge about the object. We address this problem by proposing a novel tracking architecture, consisting of dedicated target estimation and classification components. High level knowledge is incorporated into the target estimation through extensive offline learning. Our target estimation component is trained to predict the overlap between the target object and an estimated bounding box. By carefully integrating targetspecific information, our approach achieves previously unseen bounding box accuracy. We further introduce a classification component that is trained online to guarantee high discriminative power in the presence of distractors. Our final tracking framework sets a new state-of-the-art on five challenging benchmarks. On the new large-scale TrackingNet dataset, our tracker ATOM achieves a relative gain of 15% over the previous best approach, while running at over 30 FPS. Code and models are available at https://github.com/visionml/pytracking.
@inproceedings{diva2:1387537,
author = {Danelljan, Martin and Bhat, Goutam and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{ATOM: Accurate tracking by overlap maximization}},
booktitle = {2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019)},
year = {2019},
series = {Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)},
pages = {4655--4664},
publisher = {IEEE},
}
The Visual Object Tracking challenge VOT2018 is the sixth annual tracker benchmarking activity organized by the VOT initiative. Results of over eighty trackers are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in the recent years. The evaluation included the standard VOT and other popular methodologies for short-term tracking analysis and a “real-time” experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. A long-term tracking subchallenge has been introduced to the set of standard VOT sub-challenges. The new subchallenge focuses on long-term tracking properties, namely coping with target disappearance and reappearance. A new dataset has been compiled and a performance evaluation methodology that focuses on long-term tracking capabilities has been adopted. The VOT toolkit has been updated to support both standard short-term and the new long-term tracking subchallenges. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The dataset, the evaluation kit and the results are publicly available at the challenge website (http://votchallenge.net).
@inproceedings{diva2:1366619,
author = {Kristan, Matej and Leonardis, Ale\v{s} and Matas, Jirí and Felsberg, Michael and Pflugfelder, Roman and Zajc, Luka Cehovin and Vojírì, Tomá\v{s} and Bhat, Goutam and Lukezi\v{c}, Alan and Eldesokey, Abdelrahman and Fernández, Gustavo and García-Martín, Álvaro and Iglesias-Arias, Álvaro and Alatan, A. Aydin and González-García, Abel and Petrosino, Alfredo and Memarmoghadam, Alireza and Vedaldi, Andrea and Muhi\v{c}, Andrej and He, Anfeng and Smeulders, Arnold and Perera, Asanka G. and Li, Bo and Chen, Boyu and Kim, Changick and Xu, Changsheng and Xiong, Changzhen and Tian, Cheng and Luo, Chong and Sun, Chong and Hao, Cong and Kim, Daijin and Mishra, Deepak and Chen, Deming and Wang, Dong and Wee, Dongyoon and Gavves, Efstratios and Gundogdu, Erhan and Velasco-Salido, Erik and Khan, Fahad Shahbaz and Yang, Fan and Zhao, Fei and Li, Feng and Battistone, Francesco and De Ath, George and Subrahmanyam, Gorthi R. K. S. and Bastos, Guilherme and Ling, Haibin and Galoogahi, Hamed Kiani and Lee, Hankyeol and Li, Haojie and Zhao, Haojie and Fan, Heng and Zhang, Honggang and Possegger, Horst and Li, Houqiang and Lu, Huchuan and Zhi, Hui and Li, Huiyun and Lee, Hyemin and Chang, Hyung Jin and Drummond, Isabela and Valmadre, Jack and Martin, Jaime Spencer and Chahl, Javaan and Choi, Jin Young and Li, Jing and Wang, Jinqiao and Qi, Jinqing and Sung, Jinyoung and Johnander, Joakim and Henriques, Joao and Choi, Jongwon and van de Weijer, Joost and Herranz, Jorge Rodríguez and Martínez, Jos\'{e} M. and Kittler, Josef and Zhuang, Junfei and Gao, Junyu and Grm, Klemen and Zhang, Lichao and Wang, Lijun and Yang, Lingxiao and Rout, Litu and Si, Liu and Bertinetto, Luca and Chu, Lutao and Che, Manqiang and Maresca, Mario Edoardo and Danelljan, Martin and Yang, Ming-Hsuan and Abdelpakey, Mohamed and Shehata, Mohamed and Kang, Myunggu and Lee, Namhoon and Wang, Ning and Miksik, Ondrej and Moallem, P. and Vicente-Moñivar, Pablo and Senna, Pedro and Li, Peixia and Torr, Philip and Raju, Priya Mariam and Ruihe, Qian and Wang, Qiang and Zhou, Qin and Guo, Qing and Martín-Nieto, Rafael and Gorthi, Rama Krishna and Tao, Ran and Bowden, Richard and Everson, Richard and Wang, Runling and Yun, Sangdoo and Choi, Seokeon and Vivas, Sergio and Bai, Shuai and Huang, Shuangping and Wu, Sihang and Hadfield, Simon and Wang, Siwen and Golodetz, Stuart and Ming, Tang and Xu, Tianyang and Zhang, Tianzhu and Fischer, Tobias and Santopietro, Vincenzo and Štruc, Vitomir and Wei, Wang and Zuo, Wangmeng and Feng, Wei and Wu, Wei and Zou, Wei and Hu, Weiming and Zhou, Wengang and Zeng, Wenjun and Zhang, Xiaofan and Wu, Xiaohe and Wu, Xiao-Jun and Tian, Xinmei and Li, Yan and Lu, Yan and Law, Yee Wei and Wu, Yi and Demiris, Yiannis and Yang, Yicai and Jiao, Yifan and Li, Yuhong and Zhang, Yunhua and Sun, Yuxuan and Zhang, Zheng and Zhu, Zheng and Feng, Zhen-Hua and Wang, Zhihui and He, Zhiqun},
title = {{The Sixth Visual Object Tracking VOT2018 Challenge Results}},
booktitle = {Computer Vision -- ECCV 2018 Workshops},
year = {2019},
series = {Lecture Notes in Computer Science},
volume = {11129},
pages = {3--53},
publisher = {Springer Publishing Company},
address = {Cham},
}
Thermal Infrared (TIR) cameras are gaining popularity in many computer vision applications due to their ability to operate under low-light conditions. Images produced by TIR cameras are usually difficult for humans to perceive visually, which limits their usability. Several methods in the literature were proposed to address this problem by transforming TIR images into realistic visible spectrum (VIS) images. However, existing TIR-VIS datasets suffer from imperfect alignment between TIR-VIS image pairs which degrades the performance of supervised methods. We tackle this problem by learning this transformation using an unsupervised Generative Adversarial Network (GAN) which trains on unpaired TIR and VIS images. When trained and evaluated on KAIST-MS dataset, our proposed methods was shown to produce significantly more realistic and sharp VIS images than the existing state-of-the-art supervised methods. In addition, our proposed method was shown to generalize very well when evaluated on a new dataset of new environments.
@inproceedings{diva2:1365425,
author = {Nyberg, Adam and Eldesokey, Abdelrahman and Bergström, David and Gustafsson, David},
title = {{Unpaired Thermal to Visible Spectrum Transfer using Adversarial Training}},
booktitle = {Computer Vision - Eccv 2018 Workshops, Pt VI},
year = {2019},
series = {Lecture Notes in Computer Science},
volume = {11134},
pages = {657--669},
publisher = {Springer},
}
Deep learning requires large amounts of annotated data. Manual annotation of objects in video is, regardless of annotation type, a tedious and time-consuming process. In particular, for scarcely used image modalities human annotationis hard to justify. In such cases, semi-automatic annotation provides an acceptable option.
In this work, a recursive, semi-automatic annotation method for video is presented. The proposed method utilizesa state-of-the-art video object segmentation method to propose initial annotations for all frames in a video based on only a few manual object segmentations. In the case of a multi-modal dataset, the multi-modality is exploited to refine the proposed annotations even further. The final tentative annotations are presented to the user for manual correction.
The method is evaluated on a subset of the RGBT-234 visual-thermal dataset reducing the workload for a human annotator with approximately 78% compared to full manual annotation. Utilizing the proposed pipeline, sequences are annotated for the VOT-RGBT 2019 challenge.
@inproceedings{diva2:1362582,
author = {Berg, Amanda and Johnander, Joakim and Durand de Gevigney, Flavie and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Semi-automatic Annotation of Objects in Visual-Thermal Video}},
booktitle = {2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)},
year = {2019},
series = {IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
One of the fundamental challenges in video object segmentation is to find an effective representation of the target and background appearance. The best performing approaches resort to extensive fine-tuning of a convolutional neural network for this purpose. Besides being prohibitively expensive, this strategy cannot be truly trained end-to-end since the online fine-tuning procedure is not integrated into the offline training of the network. To address these issues, we propose a network architecture that learns a powerful representation of the target and background appearance in a single forward pass. The introduced appearance module learns a probabilistic generative model of target and background feature distributions. Given a new image, it predicts the posterior class probabilities, providing a highly discriminative cue, which is processed in later network modules. Both the learning and prediction stages of our appearance module are fully differentiable, enabling true end-to-end training of the entire segmentation pipeline. Comprehensive experiments demonstrate the effectiveness of the proposed approach on three video object segmentation benchmarks. We close the gap to approaches based on online fine-tuning on DAVIS17, while operating at 15 FPS on a single GPU. Furthermore, our method outperforms all published approaches on the large-scale YouTube-VOS dataset.
@inproceedings{diva2:1361997,
author = {Johnander, Joakim and Danelljan, Martin and Brissman, Emil and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{A generative appearance model for end-to-end video object segmentation}},
booktitle = {2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019},
series = {Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, IEEE Conference on Computer Vision and Pattern Recognition},
pages = {8945--8954},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Trackers based on discriminative correlation filters (DCF) have recently seen widespread success and in this work we dive into their numerical core. DCF-based trackers interleave learning of the target detector and target state inference based on this detector. Whereas the original formulation includes a closed-form solution for the filter learning, recently introduced improvements to the framework no longer have known closed-form solutions. Instead a large-scale linear least squares problem must be solved each time the detector is updated. We analyze the procedure used to optimize the detector and let the popular scheme introduced with ECO serve as a baseline. The ECO implementation is revisited in detail and several mechanisms are provided with alternatives. With comprehensive experiments we show which configurations are superior in terms of tracking capabilities and optimization performance.
@inproceedings{diva2:1361993,
author = {Johnander, Joakim and Bhat, Goutam and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{On the Optimization of Advanced DCF-Trackers}},
booktitle = {Computer Vision -- ECCV 2018 Workshops},
year = {2019},
series = {Lecture Notes in Computer Science},
volume = {11129},
pages = {54--69},
publisher = {Springer Publishing Company},
address = {Cham},
}
We address short-term, single-object tracking, a topic that is currently seeing fast progress for visual video, for the case of thermal infrared (TIR) imagery. Tracking methods designed for TIR are often subject to a number of constraints, e.g., warm objects, low spatial resolution, and static camera. As TIR cameras become less noisy and get higher resolution these constraints are less relevant, and for emerging civilian applications, e.g., surveillance and automotive safety, new tracking methods are needed. Due to the special characteristics of TIR imagery, we argue that template-based trackers based on distribution fields should have an advantage over trackers based on spatial structure features. In this paper, we propose a templatebased tracking method (ABCD) designed specifically for TIR and not being restricted by any of the constraints above. The proposed tracker is evaluated on the VOT-TIR2015 and VOT2015 datasets using the VOT evaluation toolkit and a comparison of relative ranking of all common participating trackers in the challenges is provided. Experimental results show that the ABCD tracker performs particularly well on thermal infrared sequences.
@inproceedings{diva2:1331282,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Visual Spectrum Image Generation fromThermal Infrared}},
booktitle = {Swedish Symposium on Image Analysis},
year = {2019},
}
Availability of large training datasets was essential for the recent advancement and success of deep learning methods. Due to the difficulties related to biometric data collection, datasets with age and gender annotations are scarce and usually limited in terms of size and sample diversity. Web-scraping approaches for automatic data collection can produce large amounts weakly labeled noisy data. The unsupervised facial biometric data filtering method presented in this paper greatly reduces label noise levels in web-scraped facial biometric data. Experiments on two large state-of-the-art web-scraped facial datasets demonstrate the effectiveness of the proposed method, with respect to training and validation scores, training convergence, and generalization capabilities of trained age and gender estimators.
@inproceedings{diva2:1292960,
author = {Be\v{s}eni\'{c}, Kre\v{s}imir and Ahlberg, Jörgen and Pandži\'{c}, Igor},
title = {{Unsupervised Facial Biometric Data Filtering for Age and Gender Estimation}},
booktitle = {Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP 2019)},
year = {2019},
pages = {209--217},
publisher = {SciTePress},
}
In most computer vision applications, convolutional neural networks (CNNs) operate on dense image data generated by ordinary cameras. Designing CNNs for sparse and irregularly spaced input data is still an open problem with numerous applications in autonomous driving, robotics, and surveillance. To tackle this challenging problem, we introduce an algebraically-constrained convolution layer for CNNs with sparse input and demonstrate its capabilities for the scene depth completion task. We propose novel strategies for determining the confidence from the convolution operation and propagating it to consecutive layers. Furthermore, we propose an objective function that simultaneously minimizes the data error while maximizing the output confidence. Comprehensive experiments are performed on the KITTI depth benchmark and the results clearly demonstrate that the proposed approach achieves superior performance while requiring three times fewer parameters than the state-of-the-art methods. Moreover, our approach produces a continuous pixel-wise confidence map enabling information fusion, state inference, and decision support.
@inproceedings{diva2:1233027,
author = {Eldesokey, Abdelrahman and Felsberg, Michael and Khan, Fahad Shahbaz},
title = {{Propagating Confidences through CNNs for Sparse Data Regression}},
booktitle = {British Machine Vision Conference 2018, BMVC 2018},
year = {2019},
publisher = {BMVA Press},
}
We present Lambda Twist; a novel P3P solver which is accurate, fast and robust. Current state-of-the-art P3P solvers find all roots to a quartic and discard geometrically invalid and duplicate solutions in a post-processing step. Instead of solving a quartic, the proposed P3P solver exploits the underlying elliptic equations which can be solved by a fast and numerically accurate diagonalization. This diagonalization requires a single real root of a cubic which is then used to find the, up to four, P3P solutions. Unlike the direct quartic solvers our method never computes geometrically invalid or duplicate solutions.
Extensive evaluation on synthetic data shows that the new solver has better numerical accuracy and is faster compared to the state-of-the-art P3P implementations. Implementation and benchmark are available on github.
@inproceedings{diva2:1365550,
author = {Persson, Mikael and Nordberg, Klas},
title = {{Lambda Twist: An Accurate Fast Robust Perspective Three Point (P3P) Solver}},
booktitle = {European Conference on Computer VisionECCV 2018: Computer Vision -- ECCV 2018},
year = {2018},
series = {Lecture Notes in Computer Science},
volume = {11208},
pages = {334--349},
publisher = {Springer},
address = {Cham},
}
In the field of generic object tracking numerous attempts have been made to exploit deep features. Despite all expectations, deep trackers are yet to reach an outstanding level of performance compared to methods solely based on handcrafted features. In this paper, we investigate this key issue and propose an approach to unlock the true potential of deep features for tracking. We systematically study the characteristics of both deep and shallow features, and their relation to tracking accuracy and robustness. We identify the limited data and low spatial resolution as the main challenges, and propose strategies to counter these issues when integrating deep features for tracking. Furthermore, we propose a novel adaptive fusion approach that leverages the complementary properties of deep and shallow features to improve both robustness and accuracy. Extensive experiments are performed on four challenging datasets. On VOT2017, our approach significantly outperforms the top performing tracker from the challenge with a relative gain of >17% in EAO.
@inproceedings{diva2:1361991,
author = {Bhat, Goutam and Johnander, Joakim and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Unveiling the power of deep tracking}},
booktitle = {Computer Vision -- ECCV 2018},
year = {2018},
series = {Lecture Notes in Computer Science},
volume = {11206},
pages = {493--509},
publisher = {Springer Publishing Company},
address = {Cham},
}
This paper investigates the problem of position estimation of unmanned surface vessels (USVs) operating in coastal areas or in the archipelago. We propose a position estimation method where the horizon line is extracted in a 360 degree panoramic image around the USV. We design a CNN architecture to determine an approximate horizon line in the image and implicitly determine the camera orientation (the pitch and roll angles). The panoramic image is warped to compensate for the camera orientation and to generate an image from an approximately level camera. A second CNN architecture is designed to extract the pixelwise horizon line in the warped image. The extracted horizon line is correlated with digital elevation model (DEM) data in the Fourier domain using a MOSSE correlation filter. Finally, we determine the location of the maximum correlation score over the search area to estimate the position of the USV. Comprehensive experiments are performed in a field trial in the archipelago. Our approach provides promising results by achieving position estimates with GPS-level accuracy.
@inproceedings{diva2:1361978,
author = {Grelsson, Bertil and Robinson, Andreas and Felsberg, Michael and Khan, Fahad Shahbaz},
title = {{HorizonNet for visual terrain navigation}},
booktitle = {Proceedings of 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS)},
year = {2018},
pages = {149--155},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual tracking. However, these methods still struggle in occlusion and out-of-view scenarios due to the absence of a re-detection component. While such a component requires global knowledge of the scene to ensure robust re-detection of the target, the standard DCF is only trained on the local target neighborhood. In this paper, we augment the state-of-the-art DCF tracking framework with a re-detection component based on a global appearance model. First, we introduce a tracking confidence measure to detect target loss. Next, we propose a hard negative mining strategy to extract background distractors samples, used for training the global model. Finally, we propose a robust re-detection strategy that combines the global and local appearance model predictions. We perform comprehensive experiments on the challenging UAV123 and LTB35 datasets. Our approach shows consistent improvements over the baseline tracker, setting a new state-of-the-art on both datasets.
@inproceedings{diva2:1332807,
author = {Bhat, Goutam and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Combining Local and Global Models for Robust Re-detection}},
booktitle = {Proceedings of AVSS 2018. 2018 IEEE International Conference on Advanced Video and Signal-based Surveillance, Auckland, New Zealand, 27-30 November 2018},
year = {2018},
pages = {25--30},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Recognizing human attributes in unconstrained environments is a challenging computer vision problem. State-of-the-art approaches to human attribute recognition are based on convolutional neural networks (CNNs). The de facto practice when training these CNNs on a large labeled image dataset is to take RGB pixel values of an image as input to the network. In this work, we propose a two-stream part-based deep representation for human attribute classification. Besides the standard RGB stream, we train a deep network by using mapped coded images with explicit texture information, that complements the standard RGB deep model. To integrate human body parts knowledge, we employ the deformable part-based models together with our two-stream deep model. Experiments are performed on the challenging Human Attributes (HAT-27) Dataset consisting of 27 different human attributes. Our results clearly show that (a) the two-stream deep network provides consistent gain in performance over the standard RGB model and (b) that the attribute classification results are further improved with our two-stream part-based deep representations, leading to state-of-the-art results.
@inproceedings{diva2:1265191,
author = {Anwer, Rao Muhammad and Khan, Fahad and Laaksonen, Jorma},
title = {{Two-Stream Part-based Deep Representation for Human Attribute Recognition}},
booktitle = {2018 INTERNATIONAL CONFERENCE ON BIOMETRICS (ICB)},
year = {2018},
series = {International Conference on Biometrics},
pages = {90--97},
publisher = {IEEE},
}
Mobile robots have been used for various purposes with different functionalities which require them to freely move in environments containing both static and dynamic obstacles to accomplish given tasks. One of the most relevant capabilities in terms of navigating a mobile robot in such an environment is to find a safe path to a goal position. This paper shows that there exists an accurate solution to the Laplace equation which allows finding a collision-free path and that it can be efficiently calculated for a rectangular bounded domain such as a map which is represented as an image. This is accomplished by the use of the monogenic scale space resulting in a vector field which describes the attracting and repelling forces from the obstacles and the goal. The method is shown to work in reasonably convex domains and by the use of tessellation of the environment map for non-convex environments.
@inproceedings{diva2:1263728,
author = {Holmquist, Karl and Senel, Deniz and Felsberg, Michael},
title = {{Computing a Collision-Free Path using the monogenic scale space}},
booktitle = {2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year = {2018},
series = {International Conference on Intelligent Robots and Systems (IROS)},
pages = {8097--8102},
publisher = {IEEE},
}
Selective laser melting (SLM) enables production of highly intricate components. From this point of view, the capabilities of this technology are known to the industry and have been demonstrated in numerous applications. Nonetheless, for serial production purposes the manufacturing industry has so far been reluctant in substituting its conventional methods with SLM. One underlying reason is the lack of simple and reliable process monitoring methods. This study examines the feasibility of using thermography for process monitoring. To this end, an infra-red (IR) camera was mounted off-axis to monitor and record the temperature of every layer. The recorded temperature curves are analysed and interpreted with respect to different stages of the process. Furthermore, the possibility of detecting variations in laser settings by means of thermography is demonstrated. The results show that once thermal patterns are identified, this data can be utilized for in-process and post-process monitoring of SLM production.
@inproceedings{diva2:1261349,
author = {Hatami, Sepehr and Dahl-Jendelin, Anton and Ahlberg, Jörgen and Nelsson, Claes},
title = {{Selective Laser Melting Process Monitoring by Means of Thermography}},
booktitle = {Proceedings of Euro Powder Metallurgy Congress (Euro PM)},
year = {2018},
publisher = {European Powder Metallurgy Association (EPMA)},
}
Ren et al. [17] recently introduced a method for aggregating multiple decision trees into a strong predictor by interpreting a path taken by a sample down each tree as a binary vector and performing linear regression on top of these vectors stacked together. They provided experimental evidence that the method offers advantages over the usual approaches for combining decision trees (random forests and boosting). The method truly shines when the regression target is a large vector with correlated dimensions, such as a 2D face shape represented with the positions of several facial landmarks. However, we argue that their basic method is not applicable in many practical scenarios due to large memory requirements. This paper shows how this issue can be solved through the use of quantization and architectural changes of the predictor that maps decision tree-derived encodings to the desired output.
@inproceedings{diva2:1261236,
author = {Markus, Nenad and Gogic, Ivan and Pandžic, Igor and Ahlberg, Jörgen},
title = {{Memory-efficient Global Refinement of Decision-Tree Ensembles and its Application to Face Alignment}},
booktitle = {Proceedings of BMVC 2018 and Workshops},
year = {2018},
pages = {1--11},
publisher = {The British Machine Vision Association and Society for Pattern Recognition},
address = {Newcastle upon Tyne, UK},
}
We analyze the depth reconstruction precision and sensitivity of two-frame triangulation for the case of general motion, and focus on the case of monocular visual odometry, that is: a single camera looking mostly in the direction of motion. The results confirm intuitive assumptions about the limited triangulation precision close to the focus of expansion.
@inproceedings{diva2:1259596,
author = {Fanani, Nolang and Mester, Rudolf},
title = {{The precision of triangulation in monocular visual odometry}},
booktitle = {2018 IEEE SOUTHWEST SYMPOSIUM ON IMAGE ANALYSIS AND INTERPRETATION (SSIAI)},
year = {2018},
series = {IEEE Southwest Symposium on Image Analysis and Interpretation},
pages = {73--76},
publisher = {IEEE},
}
The Exponential Linear Unit (ELU) has been proven to speed up learning and improve the classification performance over activation functions such as ReLU and Leaky ReLU for convolutional neural networks. The reasons behind the improved behavior are that ELU reduces the bias shift, it saturates for large negative inputs and it is continuously differentiable. However, it remains open whether ELU has the optimal shape and we address the quest for a superior activation function.We use a new formulation to tune a piecewise linear activation function during training, to investigate the above question, and learn the shape of the locally optimal activation function. With this tuned activation function, the classification performance is improved and the resulting, learned activation function shows to be ELU-shaped irrespective if it is initialized as a RELU, LReLU or ELU. Interestingly, the learned activation function does not exactly pass through the origin indicating that a shifted ELU-shaped activation function is preferable. This observation leads us to introduce the Shifted Exponential Linear Unit (ShELU) as a new activation function.Experiments on Cifar-100 show that the classification performance is further improved when using the ShELU activation function in comparison with ELU. The improvement is achieved when learning an individual bias shift for each neuron.
@inproceedings{diva2:1251561,
author = {Grelsson, Bertil and Felsberg, Michael},
title = {{Improved Learning in Convolutional Neural Networks with Shifted Exponential Linear Units (ShELUs)}},
booktitle = {2018 24th International Conference on Pattern Recognition (ICPR)},
year = {2018},
series = {International Conference on Pattern Recognition},
pages = {517--522},
publisher = {IEEE},
}
Recent years have witnessed a significant leap in visual object tracking performance mainly due to powerfulfeatures, sophisticated learning methods and the introduction of benchmark datasets. Despite this significantimprovement, the evaluation of state-of-the-art object trackers still relies on the classical intersection overunion (IoU) score. In this work, we argue that the object tracking evaluations based on classical IoU score aresub-optimal. As our first contribution, we theoretically prove that the IoU score is biased in the case of largetarget objects and favors over-estimated target prediction sizes. As our second contribution, we propose a newscore that is unbiased with respect to target prediction size. We systematically evaluate our proposed approachon benchmark tracking data with variations in relative target size. Our empirical results clearly suggest thatthe proposed score is unbiased in general.
@inproceedings{diva2:1248643,
author = {Häger, Gustav and Felsberg, Michael and Khan, Fahad Shahbaz},
title = {{Countering bias in tracking evaluations}},
booktitle = {Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications},
year = {2018},
pages = {581--587},
publisher = {Science and Technology Publications, Lda},
}
Probabilistic methods for point set registration have demonstrated competitive results in recent years. These techniques estimate a probability distribution model of the point clouds. While such a representation has shown promise, it is highly sensitive to variations in the density of 3D points. This fundamental problem is primarily caused by changes in the sensor location across point sets. We revisit the foundations of the probabilistic registration paradigm. Contrary to previous works, we model the underlying structure of the scene as a latent probability distribution, and thereby induce invariance to point set density changes. Both the probabilistic model of the scene and the registration parameters are inferred by minimizing the Kullback-Leibler divergence in an Expectation Maximization based framework. Our density-adaptive registration successfully handles severe density variations commonly encountered in terrestrial Lidar applications. We perform extensive experiments on several challenging real-world Lidar datasets. The results demonstrate that our approach outperforms state-of-the-art probabilistic methods for multi-view registration, without the need of re-sampling.
@inproceedings{diva2:1233671,
author = {Järemo Lawin, Felix and Danelljan, Martin and Khan, Fahad Shahbaz and Forss\'{e}n, Per-Erik and Felsberg, Michael},
title = {{Density Adaptive Point Set Registration}},
booktitle = {2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2018},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
pages = {3829--3837},
publisher = {IEEE},
}
In this paper we derive and test a probability-based weighting that can balance residuals of different types in spline fitting. In contrast to previous formulations, the proposed spline error weighting scheme also incorporates a prediction of the approximation error of the spline fit. We demonstrate the effectiveness of the prediction in a synthetic experiment, and apply it to visual-inertial fusion on rolling shutter cameras. This results in a method that can estimate 3D structure with metric scale on generic first-person videos. We also propose a quality measure for spline fitting, that can be used to automatically select the knot spacing. Experiments verify that the obtained trajectory quality corresponds well with the requested quality. Finally, by linearly scaling the weights, we show that the proposed spline error weighting minimizes the estimation errors on real sequences, in terms of scale and end-point errors.
@inproceedings{diva2:1230190,
author = {Ovr\'{e}n, Hannes and Forss\'{e}n, Per-Erik},
title = {{Spline Error Weighting for Robust Visual-Inertial Fusion}},
booktitle = {2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2018},
series = {Computer Vision and Pattern Recognition},
pages = {321--329},
}
Transformation of thermal infrared (TIR) images into visual, i.e. perceptually realistic color (RGB) images, is a challenging problem. TIR cameras have the ability to see in scenarios where vision is severely impaired, for example in total darkness or fog, and they are commonly used, e.g., for surveillance and automotive applications. However, interpretation of TIR images is difficult, especially for untrained operators. Enhancing the TIR image display by transforming it into a plausible, visual, perceptually realistic RGB image presumably facilitates interpretation. Existing grayscale to RGB, so called, colorization methods cannot be applied to TIR images directly since those methods only estimate the chrominance and not the luminance. In the absence of conventional colorization methods, we propose two fully automatic TIR to visual color image transformation methods, a two-step and an integrated approach, based on Convolutional Neural Networks. The methods require neither pre- nor postprocessing, do not require any user input, and are robust to image pair misalignments. We show that the methods do indeed produce perceptually realistic results on publicly available data, which is assessed both qualitatively and quantitatively.
@inproceedings{diva2:1229000,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Generating Visible Spectrum Images from Thermal Infrared}},
booktitle = {Proceedings 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops CVPRW 2018},
year = {2018},
series = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops},
pages = {1224--1233},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
More and more devices have depth sensors, making RGB+D video (colour+depth video) increasingly common. RGB+D video allows the use of depth image based rendering (DIBR) to render a given scene from different viewpoints, thus making it a useful asset in view prediction for 3D and free-viewpoint video coding. In this paper we evaluate a multitude of algorithms for scattered data interpolation, in order to optimize the performance of DIBR for video coding. This also includes novel contributions like a Kriging refinement step, an edge suppression step to suppress artifacts, and a scale-adaptive kernel. Our evaluation uses the depth extension of the Sintel datasets. Using ground-truth sequences is crucial for such an optimization, as it ensures that all errors and artifacts are caused by the prediction itself rather than noisy or erroneous data. We also present a comparison with the commonly used mesh-based projection.
@inproceedings{diva2:1253223,
author = {Ogniewski, Jens and Forss\'{e}n, Per-Erik},
title = {{Pushing the Limits for View Prediction in Video Coding}},
booktitle = {PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISIGRAPP 2017), VOL 4},
year = {2017},
pages = {68--76},
publisher = {SCITEPRESS},
}
Research in optical flow estimation has to a large extent focused on achieving the best possible quality with no regards to running time. Nevertheless, in a number of important applications the speed is crucial. To address this problem we present BriefMatch, a real-time optical flow method that is suitable for live applications. The method combines binary features with the search strategy from PatchMatch in order to efficiently find a dense correspondence field between images. We show that the BRIEF descriptor provides better candidates (less outlier-prone) in shorter time, when compared to direct pixel comparisons and the Census transform. This allows us to achieve high quality results from a simple filtering of the initially matched candidates. Currently, BriefMatch has the fastest running time on the Middlebury benchmark, while placing highest of all the methods that run in shorter than 0.5 seconds.
@inproceedings{diva2:1228880,
author = {Eilertsen, Gabriel and Forss\'{e}n, Per-Erik and Unger, Jonas},
title = {{BriefMatch: Dense binary feature matching for real-time optical flow estimation}},
booktitle = {Proceedings of the Scandinavian Conference on Image Analysis (SCIA17)},
year = {2017},
series = {Lecture Notes in Computer Science},
pages = {221--233},
publisher = {Springer},
}
The Visual Object Tracking challenge VOT2017 is the fifth annual tracker benchmarking activity organized by the VOT initiative. Results of 51 trackers are presented; many are state-of-the-art published at major computer vision conferences or journals in recent years. The evaluation included the standard VOT and other popular methodologies and a new "real-time" experiment simulating a situation where a tracker processes images as if provided by a continuously running sensor. Performance of the tested trackers typically by far exceeds standard baselines. The source code for most of the trackers is publicly available from the VOT page. The VOT2017 goes beyond its predecessors by (i) improving the VOT public dataset and introducing a separate VOT2017 sequestered dataset, (ii) introducing a realtime tracking experiment and (iii) releasing a redesigned toolkit that supports complex experiments. The dataset, the evaluation kit and the results are publicly available at the challenge website(1).
@inproceedings{diva2:1192158,
author = {Kristan, Matej and Leonardis, Ales and Matas, Jiri and Felsberg, Michael and Pflugfelder, Roman and Zajc, Luka Cehovin and Vojir, Tomas and Häger, Gustav and Lukezic, Alan and Eldesokey, Abdelrahman and Fernandez, Gustavo and Garcia-Martin, Alvaro and Muhic, A. and Petrosino, Alfredo and Memarmoghadam, Alireza and Vedaldi, Andrea and Manzanera, Antoine and Tran, Antoine and Alatan, Aydin and Mocanu, Bogdan and Chen, Boyu and Huang, Chang and Xu, Changsheng and Sun, Chong and Du, Dalong and Zhang, David and Du, Dawei and Mishra, Deepak and Gundogdu, Erhan and Velasco-Salido, Erik and Khan, Fahad and Battistone, Francesco and Subrahmanyam, Gorthi R. K. Sai and Bhat, Goutam and Huang, Guan and Bastos, Guilherme and Seetharaman, Guna and Zhang, Hongliang and Li, Houqiang and Lu, Huchuan and Drummond, Isabela and Valmadre, Jack and Jeong, Jae-Chan and Cho, Jae-Il and Lee, Jae-Yeong and Noskova, Jana and Zhu, Jianke and Gao, Jin and Liu, Jingyu and Kim, Ji-Wan and Henriques, Joao F. and Martinez, Jose M. and Zhuang, Junfei and Xing, Junliang and Gao, Junyu and Chen, Kai and Palaniappan, Kannappan and Lebeda, Karel and Gao, Ke and Kitani, Kris M. and Zhang, Lei and Wang, Lijun and Yang, Lingxiao and Wen, Longyin and Bertinetto, Luca and Poostchi, Mahdieh and Danelljan, Martin and Mueller, Matthias and Zhang, Mengdan and Yang, Ming-Hsuan and Xie, Nianhao and Wang, Ning and Miksik, Ondrej and Moallem, P. and Venugopal, Pallavi M. and Senna, Pedro and Torr, Philip H. S. and Wang, Qiang and Yu, Qifeng and Huang, Qingming and Martin-Nieto, Rafael and Bowden, Richard and Liu, Risheng and Tapu, Ruxandra and Hadfield, Simon and Lyu, Siwei and Golodetz, Stuart and Choi, Sunglok and Zhang, Tianzhu and Zaharia, Titus and Santopietro, Vincenzo and Zou, Wei and Hu, Weiming and Tao, Wenbing and Li, Wenbo and Zhou, Wengang and Yu, Xianguo and Bian, Xiao and Li, Yang and Xing, Yifan and Fan, Yingruo and Zhu, Zheng and Zhang, Zhipeng and He, Zhiqun},
title = {{The Visual Object Tracking VOT2017 challenge results}},
booktitle = {2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017)},
year = {2017},
series = {IEEE International Conference on Computer Vision Workshops},
pages = {1949--1972},
publisher = {IEEE},
}
The recognition of individual object instances in single monocular images is still an incompletely solved task. In this work, we propose a new approach for detecting and separating vehicles in the context of autonomous driving. Our method uses the fully convolutional network (FCN) for semantic labeling and for estimating the boundary of each vehicle. Even though a contour is in general a one pixel wide structure which cannot be directly learned by a CNN, our network addresses this by providing areas around the contours. Based on these areas, we separate the individual vehicle instances. In our experiments, we show on two challenging datasets (Cityscapes and KITTI) that we achieve state-of-the-art performance, despite the usage of a subsampling rate of two. Our approach even outperforms all recent works w.r.t. several rating scores.
@inproceedings{diva2:1192153,
author = {van den Brand, Jan and Ochs, Matthias and Mester, Rudolf},
title = {{Instance-Level Segmentation of Vehicles by Deep Contours}},
booktitle = {COMPUTER VISION - ACCV 2016 WORKSHOPS, PT I},
year = {2017},
series = {Lecture Notes in Computer Science},
pages = {477--492},
publisher = {SPRINGER INTERNATIONAL PUBLISHING AG},
}
Monocular visual odometry / SLAM requires the ability to deal with the scale ambiguity problem, or equivalently to transform the estimated unscaled poses into correctly scaled poses. While propagating the scale from frame to frame is possible, it is very prone to the scale drift effect. We address the problem of monocular scale estimation by proposing a multimodal mechanism of prediction, classification, and correction. Our scale correction scheme combines cues from both dense and sparse ground plane estimation; this makes the proposed method robust towards varying availability and distribution of trackable ground structure. Instead of optimizing the parameters of the ground plane related homography, we parametrize and optimize the underlying motion parameters directly. Furthermore, we employ classifiers to detect scale outliers based on various features (e.g. moments on residuals). We test our method on the challenging KITTI dataset and show that the proposed method is capable to provide scale estimates that are on par with current state-of-the-art monocular methods without using bundle adjustment or RANSAC.
@inproceedings{diva2:1192092,
author = {Fanani, Nolang and Stuerck, Alina and Barnada, Marc and Mester, Rudolf},
title = {{Multimodal Scale Estimation for Monocular Visual Odometry}},
booktitle = {2017 28TH IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV 2017)},
year = {2017},
series = {IEEE Intelligent Vehicles Symposium},
pages = {1714--1721},
publisher = {IEEE},
}
Most iterative optimization algorithms for motion, depth estimation or scene reconstruction, both sparse and dense, rely on a coarse and reliable dense initialization to bootstrap their optimization procedure. This makes techniques important that allow to obtain a dense but still approximative representation of a desired 2D structure (e.g., depth maps, optical flow, disparity maps) from a very sparse measurement of this structure. The method presented here exploits the complete information given by the principal component analysis (PCA), the principal basis and its prior distribution. The method is able to determine a dense reconstruction even if only a very sparse measurement is available. When facing such situations, typically the number of principal components is further reduced which results in a loss of expressiveness of the basis. We overcome this problem and inject prior knowledge in a maximum a posteriori (MAP) approach. We test our approach on the KITTI and the Virtual KITTI dataset and focus on the interpolation of depth maps for driving scenes. The evaluation of the results shows good agreement to the ground truth and is clearly superior to the results of an interpolation by the nearest neighbor method which disregards statistical information.
@inproceedings{diva2:1192091,
author = {Ochs, Matthias and Bradler, Henry and Mester, Rudolf},
title = {{Learning Rank Reduced Interpolation with Principal Component Analysis}},
booktitle = {2017 28TH IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV 2017)},
year = {2017},
series = {IEEE Intelligent Vehicles Symposium},
pages = {1126--1133},
publisher = {IEEE},
}
Semantic segmentation of 3D point clouds is a challenging problem with numerous real-world applications. While deep learning has revolutionized the field of image semantic segmentation, its impact on point cloud data has been limited so far. Recent attempts, based on 3D deep learning approaches (3D-CNNs), have achieved below-expected results. Such methods require voxelizations of the underlying point cloud data, leading to decreased spatial resolution and increased memory consumption. Additionally, 3D-CNNs greatly suffer from the limited availability of annotated datasets.
@inproceedings{diva2:1185653,
author = {Järemo-Lawin, Felix and Danelljan, Martin and Tosteberg, Patrik and Bhat, Goutam and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Deep Projective 3D Semantic Segmentation}},
booktitle = {Computer Analysis of Images and Patterns},
year = {2017},
series = {Lecture Notes in Computer Science},
volume = {10424},
pages = {95--107},
publisher = {Springer},
}
Discriminative Correlation Filter (DCF) based methods have shown competitive performance on tracking benchmarks in recent years. Generally, DCF based trackers learn a rigid appearance model of the target. However, this reliance on a single rigid appearance model is insufficient in situations where the target undergoes non-rigid transformations. In this paper, we propose a unified formulation for learning a deformable convolution filter. In our framework, the deformable filter is represented as a linear combination of sub-filters. Both the sub-filter coefficients and their relative locations are inferred jointly in our formulation. Experiments are performed on three challenging tracking benchmarks: OTB-2015, TempleColor and VOT2016. Our approach improves the baseline method, leading to performance comparable to state-of-the-art.
@inproceedings{diva2:1185623,
author = {Johnander, Joakim and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{DCCO:
Towards Deformable Continuous Convolution Operators for Visual Tracking}},
booktitle = {Computer Analysis of Images and Patterns},
year = {2017},
series = {Lecture Notes in Computer Science},
volume = {10424},
pages = {55--67},
publisher = {Springer},
}
Autonomous driving safety is becoming a paramount issue due to the emergence of many autonomous vehicle prototypes. The safety measures ensure that autonomous vehicles are safe to operate among pedestrians, cyclists and conventional vehicles. While safety measures for pedestrians have been widely studied in literature, little attention has been paid to safety measures for cyclists. Visual cyclists analysis is a challenging problem due to the complex structure and dynamic nature of the cyclists. The dynamic model used for cyclists analysis heavily relies on the wheels. In this paper, we investigate the problem of ellipse detection for visual cyclists analysis in the wild. Our first contribution is the introduction of a new challenging annotated dataset for bicycle wheels, collected in real-world urban environment. Our second contribution is a method that combines reliable arcs selection and grouping strategies for ellipse detection. The reliable selection and grouping mechanism leads to robust ellipse detections when combined with the standard least square ellipse fitting approach. Our experiments clearly demonstrate that our method provides improved results, both in terms of accuracy and robustness in challenging urban environment settings.
@inproceedings{diva2:1185617,
author = {Eldesokey, Abdelrahman and Felsberg, Michael and Khan, Fahad Shahbaz},
title = {{Ellipse Detection for Visual Cyclists Analysis ``In the Wild''}},
booktitle = {Computer Analysis of Images and Patterns},
year = {2017},
series = {Lecture Notes in Computer Science},
volume = {10424},
pages = {319--331},
publisher = {Springer},
}
An increasing number of robots and autonomous vehicles are equipped with multiple cameras to achieve surround-view sensing. The estimation of their relative poses, also known as extrinsic parameter calibration, is a challenging problem, particularly in the non-overlapping case. We present a simple and novel extrinsic calibration method based on standard components that performs favorably to existing approaches. We further propose a framework for predicting the performance of different calibration configurations and intuitive error metrics. This makes selecting a good camera configuration straightforward. We evaluate on rendered synthetic images and show good results as measured by angular and absolute pose differences, as well as the reprojection error distributions.
@inproceedings{diva2:1185614,
author = {Robinson, Andreas and Persson, Mikael and Felsberg, Michael},
title = {{Robust Accurate Extrinsic Calibration of Static Non-overlapping Cameras}},
booktitle = {Computer Analysis of Images and Patterns},
year = {2017},
series = {Lecture Notes in Computer Science},
volume = {10425},
pages = {342--353},
publisher = {Springer},
}
In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model; (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples; (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and Temple-Color. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.0% relative gain in Expected Average Overlap compared to the top ranked method [12] in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 65.0% AUC on OTB-2015.
@inproceedings{diva2:1173604,
author = {Danelljan, Martin and Bhat, Goutam and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{ECO: Efficient Convolution Operators for Tracking}},
booktitle = {Proceedings 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
volume = {2017},
pages = {6931--6939},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Many of the latest smart phones and tablets come with integrated depth sensors, that make depth-maps freely available, thus enabling new forms of applications like rendering from different view points. However, efficient compression exploiting the characteristics of depth-maps as well as the requirements of these new applications is still an open issue. In this paper, we evaluate different depth-map compression algorithms, with a focus on tree-based methods and view projection as application.
The contributions of this paper are the following: 1. extensions of existing geometric compression trees, 2. a comparison of a number of different trees, 3. a comparison of them to a state-of-the-art video coder, 4. an evaluation using ground-truth data that considers both depth-maps and predicted frames with arbitrary camera translation and rotation.
Despite our best efforts, and contrary to earlier results, current video depth-map compression outperforms tree-based methods in most cases. The reason for this is likely that previous evaluations focused on low-quality, low-resolution depth maps, while high-resolution depth (as needed in the DIBR setting) has been ignored up until now. We also demonstrate that PSNR on depth-maps is not always a good measure of their utility.
@inproceedings{diva2:1150797,
author = {Ogniewski, Jens and Forss\'{e}n, Per-Erik},
title = {{What is the best depth-map compression for Depth Image Based Rendering?}},
booktitle = {Computer Analysis of Images and Patterns},
year = {2017},
series = {Lecture Notes in Computer Science},
volume = {10425},
pages = {403--415},
publisher = {Springer},
}
The ability to direct visual attention is a fundamental skill for seeing robots. Attention comes in two flavours: the gaze direction (overt attention) and attention to a specific part of the current field of view (covert attention), of which the latter is the focus of the present study. Specifically, we study the effects of attentional masking within pre-trained deep neural networks for the purpose of handling ambiguous scenes containing multiple objects. We investigate several variants of attentional masking on partially pre-trained deep neural networks and evaluate the effects on classification performance and sensitivity to attention mask errors in multi-object scenes. We find that a combined scheme consisting of multi-level masking and blending provides the best trade-off between classification accuracy and insensitivity to masking errors. This proposed approach is denoted multilayer continuous-valued convolutional feature masking (MC-CFM). For reasonably accurate masks it can suppress the influence of distracting objects and reach comparable classification performance to unmasked recognition in cases without distractors.
@inproceedings{diva2:1150792,
author = {Wallenberg, Marcus and Forssen, Per-Erik},
title = {{Attentional Masking for Pre-trained Deep Networks}},
booktitle = {Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS17)},
year = {2017},
pages = {6149--6154},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Traditionally, pose estimation is considered as a two step problem. First, feature correspondences are determined by direct comparison of image patches, or by associating feature descriptors. In a second step, the relative pose and the coordinates of corresponding points are estimated, most often by minimizing the reprojection error (RPE). RPE optimization is based on a loss function that is merely aware of the feature pixel positions but not of the underlying image intensities. In this paper, we propose a sparse direct method which introduces a loss function that allows to simultaneously optimize the unscaled relative pose, as well as the set of feature correspondences directly considering the image intensity values. Furthermore, we show how to integrate statistical prior information on the motion into the optimization process. This constructive inclusion of a Bayesian bias term is particularly efficient in application cases with a strongly predictable (short term) dynamic, e.g. in a driving scenario. In our experiments, we demonstrate that the JET` algorithm we propose outperforms the classical reprojection error optimization on two synthetic datasets and on the KITTI dataset. The JET algorithm runs in real-time on a single CPU thread.
@inproceedings{diva2:1129770,
author = {Bradler, Henry and Ochs, Matthias and Fanani, Nolang and Mester, Rudolf},
title = {{Joint Epipolar Tracking (JET): Simultaneous optimization of epipolar geometry and feature correspondences}},
booktitle = {2017 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2017)},
year = {2017},
series = {IEEE Winter Conference on Applications of Computer Vision},
pages = {445--453},
publisher = {IEEE},
}
Hyperspectral remote sensing based on unmanned airborne vehicles is a field increasing in importance. The combined functionality of simultaneous hyperspectral and geometric modeling is less developed. A configuration has been developed that enables the reconstruction of the hyperspectral three-dimensional (3D) environment. The hyperspectral camera is based on a linear variable filter and a high frame rate, high resolution camera enabling point-to-point matching and 3D reconstruction. This allows the information to be combined into a single and complete 3D hyperspectral model. In this paper, we describe the camera and illustrate capabilities and difficulties through real-world experiments.
@inproceedings{diva2:1107480,
author = {Ahlberg, Jörgen and Renhorn, Ingmar and Chevalier, Tomas and Rydell, Joakim and Bergström, David},
title = {{Three-dimensional hyperspectral imaging technique}},
booktitle = {ALGORITHMS AND TECHNOLOGIES FOR MULTISPECTRAL, HYPERSPECTRAL, AND ULTRASPECTRAL IMAGERY XXIII},
year = {2017},
series = {Proceedings of SPIE},
volume = {10198},
publisher = {SPIE - International Society for Optical Engineering},
}
We address short-term, single-object tracking, a topic that is currently seeing fast progress for visual video, for the case of thermal infrared (TIR) imagery. Tracking methods designed for TIR are often subject to a number of constraints, e.g., warm objects, low spatial resolution, and static camera. As TIR cameras become less noisy and get higher resolution these constraints are less relevant, and for emerging civilian applications, e.g., surveillance and automotive safety, new tracking methods are needed. Due to the special characteristics of TIR imagery, we argue that template-based trackers based on distribution fields should have an advantage over trackers based on spatial structure features. In this paper, we propose a templatebased tracking method (ABCD) designed specifically for TIR and not being restricted by any of the constraints above. The proposed tracker is evaluated on the VOT-TIR2015 and VOT2015 datasets using the VOT evaluation toolkit and a comparison of relative ranking of all common participating trackers in the challenges is provided. Experimental results show that the ABCD tracker performs particularly well on thermal infrared sequences.
@inproceedings{diva2:1090347,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Object Tracking in Thermal Infrared Imagery based on Channel Coded Distribution Fields}},
booktitle = {Swedish Symposium on Image Analysis},
year = {2017},
publisher = {Svenska sällskapet för automatiserad bildanalys (SSBA)},
}
This paper presents a study on a family of local hexagonal and multi-scale operators useful for texture analysis. The hexagonal grid shows an attractive rotation symmetry with uniform neighbour distances. The operator depicts a closed connected curve (1D periodic). It is resized within a scale interval during the conversion from the original square grid to the virtual hexagonal grid. Complementary image features, together with their tangential first-order hexagonal derivatives, are calculated. The magnitude/phase information from the Fourier or Fractional Fourier Transform (FFT, FrFT) are accumulated in thirty different Cartesian (polar for visualisation) and multi-scale domains. Simultaneous phase-correlation of a subset of the data gives an estimate of scaling/rotation relative the references. Similarity metrics are used as template matching. The sample, unseen by the system, is classified into the group with the maximum fuzzy rank order. An instantiation of a 12-point hexagonal operator (radius=2) is first successfully evaluated on a set of thirteen Brodatz images (no scaling/rotation). Then it is evaluated on the more challenging KTH-TIPS2b texture dataset (scaling/rotation, varying pose/illumination). A confusion matrix and cumulative fuzzy rank order summaries show, for example, that the correct class is top-ranked 44 - 50% and top-three ranked 68 - 76% of all sample images. A similar evaluation, using a box-like 12-point mask of square grids, gives overall lower accuracies. Finally, the FrFT parameter is an additional tuning parameter influencing the accuracies significantly.
@inproceedings{diva2:1147263,
author = {Brandtberg, Tomas},
title = {{Virtual hexagonal and multi-scale operator for fuzzy rank order texture classification using one-dimensional generalised Fourier analysis}},
booktitle = {2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)},
year = {2016},
series = {International Conference on Pattern Recognition},
pages = {2018--2024},
publisher = {IEEE COMPUTER SOC},
}
Current best local descriptors are learned on a large dataset of matching and non-matching keypoint pairs. However, data of this kind is not always available since detailed keypoint correspondences can be hard to establish. On the other hand, we can often obtain labels for pairs of keypoint bags. For example, keypoint bags extracted from two images of the same object under different views form a matching pair, and keypoint bags extracted from images of different objects form a non-matching pair. On average, matching pairs should contain more corresponding keypoints than non-matching pairs. We describe an end-to-end differentiable architecture that enables the learning of local keypoint descriptors from such weakly-labeled data.
@inproceedings{diva2:1147261,
author = {Markus, Nenad and Pandzic, Igor S. and Ahlberg, Jörgen},
title = {{Learning Local Descriptors by Optimizing the Keypoint-Correspondence Criterion}},
booktitle = {2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)},
year = {2016},
series = {International Conference on Pattern Recognition},
pages = {2380--2385},
publisher = {IEEE COMPUTER SOC},
}
Tracking-by-detection methods have demonstrated competitive performance in recent years. In these approaches, the tracking model heavily relies on the quality of the training set. Due to the limited amount of labeled training data, additional samples need to be extracted and labeled by the tracker itself. This often leads to the inclusion of corrupted training samples, due to occlusions, misalignments and other perturbations. Existing tracking-by-detection methods either ignore this problem, or employ a separate component for managing the training set. We propose a novel generic approach for alleviating the problem of corrupted training samples in tracking-by-detection frameworks. Our approach dynamically manages the training set by estimating the quality of the samples. Contrary to existing approaches, we propose a unified formulation by minimizing a single loss over both the target appearance model and the sample quality weights. The joint formulation enables corrupted samples to be down-weighted while increasing the impact of correct ones. Experiments are performed on three benchmarks: OTB-2015 with 100 videos, VOT-2015 with 60 videos, and Temple-Color with 128 videos. On the OTB-2015, our unified formulation significantly improves the baseline, with a gain of 3.8% in mean overlap precision. Finally, our method achieves state-of-the-art results on all three datasets.
@inproceedings{diva2:1104732,
author = {Danelljan, Martin and Häger, Gustav and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Adaptive Decontamination of the Training Set: A Unified Formulation for Discriminative Visual Tracking}},
booktitle = {2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
volume = {2016},
pages = {1430--1438},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
In recent years, sensors capable of measuring both color and depth information have become increasingly popular. Despite the abundance of colored point set data, state-of-the-art probabilistic registration techniques ignore the available color information. In this paper, we propose a probabilistic point set registration framework that exploits available color information associated with the points. Our method is based on a model of the joint distribution of 3D-point observations and their color information. The proposed model captures discriminative color information, while being computationally efficient. We derive an EM algorithm for jointly estimating the model parameters and the relative transformations. Comprehensive experiments are performed on the Stanford Lounge dataset, captured by an RGB-D camera, and two point sets captured by a Lidar sensor. Our results demonstrate a significant gain in robustness and accuracy when incorporating color information. On the Stanford Lounge dataset, our approach achieves a relative reduction of the failure rate by 78% compared to the baseline. Furthermore, our proposed model outperforms standard strategies for combining color and 3D-point information, leading to state-of-the-art results.
@inproceedings{diva2:1104730,
author = {Danelljan, Martin and Meneghetti, Giulia and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{A Probabilistic Framework for Color-Based Point Set Registration}},
booktitle = {2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2016},
series = {IEEE Conference on Computer Vision and Pattern Recognition},
volume = {2016},
pages = {1818--1826},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Visual object tracking performance has improved significantly in recent years. Most trackers are based on either of two paradigms: online learning of an appearance model or the use of a pre-trained object detector. Methods based on online learning provide high accuracy, but are prone to model drift. The model drift occurs when the tracker fails to correctly estimate the tracked object’s position. Methods based on a detector on the other hand typically have good long-term robustness, but reduced accuracy compared to online methods.
Despite the complementarity of the aforementioned approaches, the problem of fusing them into a single framework is largely unexplored. In this paper, we propose a novel fusion between an online tracker and a pre-trained detector for tracking humans from a UAV. The system operates at real-time on a UAV platform. In addition we present a novel dataset for long-term tracking in a UAV setting, that includes scenarios that are typically not well represented in standard visual tracking datasets.
@inproceedings{diva2:1104310,
author = {Häger, Gustav and Bhat, Goutam and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael and Rudol, Piotr and Doherty, Patrick},
title = {{Combining Visual Tracking and Person Detection for Long Term Tracking on a UAV}},
booktitle = {Proceedings of the 12th International Symposium on Advances in Visual Computing},
year = {2016},
series = {Lecture Notes in Computer Science},
publisher = {Springer},
}
Robust visual tracking is a challenging computer vision problem, with many real-world applications. Most existing approaches employ hand-crafted appearance features, such as HOG or Color Names. Recently, deep RGB features extracted from convolutional neural networks have been successfully applied for tracking. Despite their success, these features only capture appearance information. On the other hand, motion cues provide discriminative and complementary information that can improve tracking performance. Contrary to visual tracking, deep motion features have been successfully applied for action recognition and video classification tasks. Typically, the motion features are learned by training a CNN on optical flow images extracted from large amounts of labeled videos. This paper presents an investigation of the impact of deep motion features in a tracking-by-detection framework. We further show that hand-crafted, deep RGB, and deep motion features contain complementary information. To the best of our knowledge, we are the first to propose fusing appearance information with deep motion features for visual tracking. Comprehensive experiments clearly suggest that our fusion approach with deep motion features outperforms standard methods relying on appearance information alone.
@inproceedings{diva2:1104308,
author = {Gladh, Susanna and Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Deep motion features for visual tracking}},
booktitle = {Proceedings of the 23rd International Conference on, Pattern Recognition (ICPR), 2016},
year = {2016},
pages = {1243--1248},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
3D-point set registration is an active area of research in computer vision. In recent years, probabilistic registration approaches have demonstrated superior performance for many challenging applications. Generally, these probabilistic approaches rely on the spatial distribution of the 3D-points, and only recently color information has been integrated into such a framework, significantly improving registration accuracy. Other than local color information, high-dimensional 3D shape features have been successfully employed in many applications such as action recognition and 3D object recognition. In this paper, we propose a probabilistic framework to integrate high-dimensional 3D shape features with color information for point set registration. The 3D shape features are distinctive and provide complementary information beneficial for robust registration. We validate our proposed framework by performing comprehensive experiments on the challenging Stanford Lounge dataset, acquired by a RGB-D sensor, and an outdoor dataset captured by a Lidar sensor. The results clearly demonstrate that our approach provides superior results both in terms of robustness and accuracy compared to state-of-the-art probabilistic methods.
@inproceedings{diva2:1104306,
author = {Danelljan, Martin and Meneghetti, Giulia and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Aligning the Dissimilar: A Probabilistic Feature-Based Point Set Registration Approach}},
booktitle = {Proceedings of the 23rd International Conference on Pattern Recognition (ICPR) 2016},
year = {2016},
pages = {247--252},
publisher = {IEEE},
}
We address short-term, single-object tracking, a topic that is currently seeing fast progress for visual video, for the case of thermal infrared (TIR) imagery. The fast progress has been possible thanks to the development of new template-based tracking methods with online template updates, methods which have not been explored for TIR tracking. Instead, tracking methods used for TIR are often subject to a number of constraints, e.g., warm objects, low spatial resolution, and static camera. As TIR cameras become less noisy and get higher resolution these constraints are less relevant, and for emerging civilian applications, e.g., surveillance and automotive safety, new tracking methods are needed. Due to the special characteristics of TIR imagery, we argue that template-based trackers based on distribution fields should have an advantage over trackers based on spatial structure features. In this paper, we propose a template-based tracking method (ABCD) designed specifically for TIR and not being restricted by any of the constraints above. In order to avoid background contamination of the object template, we propose to exploit background information for the online template update and to adaptively select the object region used for tracking. Moreover, we propose a novel method for estimating object scale change. The proposed tracker is evaluated on the VOT-TIR2015 and VOT2015 datasets using the VOT evaluation toolkit and a comparison of relative ranking of all common participating trackers in the challenges is provided. Further, the proposed tracker, ABCD, and the VOT-TIR2015 winner SRDCFir are evaluated on maritime data. Experimental results show that the ABCD tracker performs particularly well on thermal infrared sequences.
@inproceedings{diva2:1072885,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Channel Coded Distribution Field Tracking for Thermal Infrared Imagery}},
booktitle = {PROCEEDINGS OF 29TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, (CVPRW 2016)},
year = {2016},
series = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops},
pages = {1248--1256},
publisher = {IEEE},
}
Random Forests (RF) is a learning techniquewith very low run-time complexity. It has found a nicheapplication in situations where input data is low-dimensionaland computational performance is paramount. We wish tomake RFs more useful for high dimensional problems, andto this end, we propose two extensions to RFs: Firstly, afeature selection mechanism called correlation-enhancing pro-jections, and secondly sparse discriminant selection schemes forbetter accuracy and faster training. We evaluate the proposedextensions by performing age and gender estimation on theMORPH-II dataset, and demonstrate near-equal or improvedestimation performance when using these extensions despite aseventy-fold reduction in the number of data dimensions.
@inproceedings{diva2:1068782,
author = {Wallenberg, Marcus and Forss\'{e}n, Per-Erik},
title = {{Improving Random Forests by Correlation-Enhancing Projections and Sample-Based Sparse Discriminant Selection}},
booktitle = {Proceedings 13th Conference on Computer and Robot Vision CRV 2016},
year = {2016},
pages = {222--227},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
One of the major steps in visual environment perception for automotive applications is to track keypoints and to subsequently estimate egomotion and environment structure from the trajectories of these keypoints. This paper presents a propagation based tracking method to obtain the 2D trajectories of keypoints from a sequence of images in a monocular camera setup. Instead of relying on the classical RANSAC to obtain accurate keypoint correspondences, we steer the search for keypoint matches by means of propagating the estimated 3D position of the keypoint into the next frame and verifying the photometric consistency. In this process, we continuously predict, estimate and refine the frame-to-frame relative pose which induces the epipolar relation. Experiments on the KITTI dataset as well as on the synthetic COnGRATS dataset show promising results on the estimated courses and accurate keypoint trajectories.
@inproceedings{diva2:1067522,
author = {Fanani, Nolang and Ochs, Matthias and Bradler, Henry and Mester, Rudolf},
title = {{Keypoint Trajectory Estimation Using Propagation Based Tracking}},
booktitle = {2016 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV)},
year = {2016},
series = {IEEE Intelligent Vehicles Symposium},
pages = {933--939},
publisher = {IEEE},
}
We present a framework that supports the development and evaluation of vision algorithms in the context of driver assistance applications and traffic surveillance. This framework allows the creation of highly realistic image sequences featuring traffic scenarios. The sequences are created with a realistic state of the art vehicle physics model; different kinds of environments are featured, thus providing a wide range of testing scenarios. Due to the physically-based rendering technique and variable camera models employed for the image rendering process, we can simulate different sensor setups and provide appropriate and fully accurate ground truth data.
@inproceedings{diva2:1067521,
author = {Biedermann, Daniel and Ochs, Matthias and Mester, Rudolf},
title = {{Evaluating visual ADAS components on the COnGRATS dataset}},
booktitle = {2016 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV)},
year = {2016},
series = {IEEE Intelligent Vehicles Symposium},
pages = {986--991},
publisher = {IEEE},
}
Correspondence relations between different views of the same scene can be learnt in an unsupervised manner. We address autonomous learning of arbitrary fixed spatial (point-to-point) mappings. Since any such transformation can be represented by a permutation matrix, the signal model is a linear one, whereas the proposed analysis method, mainly based on Canonical Correlation Analysis (CCA) is based on a generalized eigensystem problem, i.e., a nonlinear operation. The learnt transformation is represented implicitly in terms of pairs of learned basis vectors and does neither use nor require an analytic/parametric expression for the latent mapping. We show how the rank of the signal that is shared among views may be determined from canonical correlations and how the overlapping (=shared) dimensions among the views may be inferred.
@inproceedings{diva2:1067517,
author = {Conrad, Christian and Mester, Rudolf},
title = {{LEARNING RANK REDUCED MAPPINGS USING CANONICAL CORRELATION ANALYSIS}},
booktitle = {2016 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP)},
year = {2016},
publisher = {IEEE},
}
The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment.
@inproceedings{diva2:1063965,
author = {Kristan, Matej and Leonardis, Ales and Matas, Jiri and Felsberg, Michael and Pflugfelder, Roman and Cehovin, Luka and Vojir, Tomas and Häger, Gustav and Lukezic, Alan and Fernandez, Gustavo and Gupta, Abhinav and Petrosino, Alfredo and Memarmoghadam, Alireza and Garcia-Martin, Alvaro and Solis Montero, Andres and Vedaldi, Andrea and Robinson, Andreas and Ma, Andy J. and Varfolomieiev, Anton and Alatan, Aydin and Erdem, Aykut and Ghanem, Bernard and Liu, Bin and Han, Bohyung and Martinez, Brais and Chang, Chang-Ming and Xu, Changsheng and Sun, Chong and Kim, Daijin and Chen, Dapeng and Du, Dawei and Mishra, Deepak and Yeung, Dit-Yan and Gundogdu, Erhan and Erdem, Erkut and Khan, Fahad and Porikli, Fatih and Zhao, Fei and Bunyak, Filiz and Battistone, Francesco and Zhu, Gao and Roffo, Giorgio and Sai Subrahmanyam, Gorthi R. K. and Bastos, Guilherme and Seetharaman, Guna and Medeiros, Henry and Li, Hongdong and Qi, Honggang and Bischof, Horst and Possegger, Horst and Lu, Huchuan and Lee, Hyemin and Nam, Hyeonseob and Jin Chang, Hyung and Drummond, Isabela and Valmadre, Jack and Jeong, Jae-chan and Cho, Jae-il and Lee, Jae-Yeong and Zhu, Jianke and Feng, Jiayi and Gao, Jin and Young Choi, Jin and Xiao, Jingjing and Kim, Ji-Wan and Jeong, Jiyeoup and Henriques, Joao F. and Lang, Jochen and Choi, Jongwon and Martinez, Jose M. and Xing, Junliang and Gao, Junyu and Palaniappan, Kannappan and Lebeda, Karel and Gao, Ke and Mikolajczyk, Krystian and Qin, Lei and Wang, Lijun and Wen, Longyin and Bertinetto, Luca and Kumar Rapuru, Madan and Poostchi, Mahdieh and Maresca, Mario and Danelljan, Martin and Mueller, Matthias and Zhang, Mengdan and Arens, Michael and Valstar, Michel and Tang, Ming and Baek, Mooyeol and Haris Khan, Muhammad and Wang, Naiyan and Fan, Nana and Al-Shakarji, Noor and Miksik, Ondrej and Akin, Osman and Moallem, Payman and Senna, Pedro and Torr, Philip H. S. and Yuen, Pong C. and Huang, Qingming and Martin-Nieto, Rafael and Pelapur, Rengarajan and Bowden, Richard and Laganiere, Robert and Stolkin, Rustam and Walsh, Ryan and Krah, Sebastian B. and Li, Shengkun and Zhang, Shengping and Yao, Shizeng and Hadfield, Simon and Melzi, Simone and Lyu, Siwei and Li, Siyi and Becker, Stefan and Golodetz, Stuart and Kakanuru, Sumithra and Choi, Sunglok and Hu, Tao and Mauthner, Thomas and Zhang, Tianzhu and Pridmore, Tony and Santopietro, Vincenzo and Hu, Weiming and Li, Wenbo and Huebner, Wolfgang and Lan, Xiangyuan and Wang, Xiaomeng and Li, Xin and Li, Yang and Demiris, Yiannis and Wang, Yifan and Qi, Yuankai and Yuan, Zejian and Cai, Zexiong and Xu, Zhan and He, Zhenyu and Chi, Zhizhen},
title = {{The Visual Object Tracking VOT2016 Challenge Results}},
booktitle = {COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II},
year = {2016},
series = {Lecture Notes in Computer Science},
volume = {9914},
pages = {777--823},
publisher = {SPRINGER INT PUBLISHING AG},
}
The Thermal Infrared Visual Object Tracking challenge 2016, VOT-TIR2016, aims at comparing short-term single-object visual trackers that work on thermal infrared (TIR) sequences and do not apply pre-learned models of object appearance. VOT-TIR2016 is the second benchmark on short-term tracking in TIR sequences. Results of 24 trackers are presented. For each participating tracker, a short description is provided in the appendix. The VOT-TIR2016 challenge is similar to the 2015 challenge, the main difference is the introduction of new, more difficult sequences into the dataset. Furthermore, VOT-TIR2016 evaluation adopted the improvements regarding overlap calculation in VOT2016. Compared to VOT-TIR2015, a significant general improvement of results has been observed, which partly compensate for the more difficult sequences. The dataset, the evaluation kit, as well as the results are publicly available at the challenge website.
@inproceedings{diva2:1063949,
author = {Felsberg, Michael and Kristan, Matej and Matas, Jiri and Leonardis, Ales and Pflugfelder, Roman and Häger, Gustav and Berg, Amanda and Eldesokey, Abdelrahman and Ahlberg, Jörgen and Cehovin, Luka and Vojir, Tomas and Lukezic, Alan and Fernandez, Gustavo and Petrosino, Alfredo and Garcia-Martin, Alvaro and Solis Montero, Andres and Varfolomieiev, Anton and Erdem, Aykut and Han, Bohyung and Chang, Chang-Ming and Du, Dawei and Erdem, Erkut and Khan, Fahad Shahbaz and Porikli, Fatih and Zhao, Fei and Bunyak, Filiz and Battistone, Francesco and Zhu, Gao and Seetharaman, Guna and Li, Hongdong and Qi, Honggang and Bischof, Horst and Possegger, Horst and Nam, Hyeonseob and Valmadre, Jack and Zhu, Jianke and Feng, Jiayi and Lang, Jochen and Martinez, Jose M. and Palaniappan, Kannappan and Lebeda, Karel and Gao, Ke and Mikolajczyk, Krystian and Wen, Longyin and Bertinetto, Luca and Poostchi, Mahdieh and Maresca, Mario and Danelljan, Martin and Arens, Michael and Tang, Ming and Baek, Mooyeol and Fan, Nana and Al-Shakarji, Noor and Miksik, Ondrej and Akin, Osman and Torr, Philip H. S. and Huang, Qingming and Martin-Nieto, Rafael and Pelapur, Rengarajan and Bowden, Richard and Laganiere, Robert and Krah, Sebastian B. and Li, Shengkun and Yao, Shizeng and Hadfield, Simon and Lyu, Siwei and Becker, Stefan and Golodetz, Stuart and Hu, Tao and Mauthner, Thomas and Santopietro, Vincenzo and Li, Wenbo and Huebner, Wolfgang and Li, Xin and Li, Yang and Xu, Zhan and He, Zhenyu},
title = {{The Thermal Infrared Visual Object Tracking VOT-TIR2016 Challenge Results}},
booktitle = {Computer Vision -- ECCV 2016 Workshops. ECCV 2016.},
year = {2016},
series = {Lecture Notes in Computer Science},
volume = {9914},
pages = {824--849},
publisher = {SPRINGER INT PUBLISHING AG},
}
In this paper we introduce an efficient method to unwrap multi-frequency phase estimates for time-of-flight ranging. The algorithm generates multiple depth hypotheses and uses a spatial kernel density estimate (KDE) to rank them. The confidence produced by the KDE is also an effective means to detect outliers. We also introduce a new closed-form expression for phase noise prediction, that better fits real data. The method is applied to depth decoding for the Kinect v2 sensor, and compared to the Microsoft Kinect SDK and to the open source driver libfreenect2. The intended Kinect v2 use case is scenes with less than 8m range, and for such cases we observe consistent improvements, while maintaining real-time performance. When extending the depth range to the maximal value of 18.75 m, we get about 52% more valid measurements than libfreenect2. The effect is that the sensor can now be used in large depth scenes, where it was previously not a good choice.
@inproceedings{diva2:1060849,
author = {Järemo-Lawin, Felix and Forss\'{e}n, Per-Erik and Ovr\'{e}n, Hannes},
title = {{Efficient Multi-frequency Phase Unwrapping Using Kernel Density Estimation}},
booktitle = {Computer Vision -- ECCV 2016 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV},
year = {2016},
series = {Lecture Notes in Computer Science},
volume = {9908},
pages = {170--185},
publisher = {Springer},
}
Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments.
@inproceedings{diva2:1060848,
author = {Danelljan, Martin and Robinson, Andreas and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking}},
booktitle = {Computer Vision -- ECCV 2016},
year = {2016},
series = {Lecture Notes in Computer Science},
volume = {9909},
pages = {472--488},
publisher = {Springer},
address = {Cham},
}
Automatic analysis of visual art, such as paintings, is a challenging inter-disciplinary research problem. Conventional approaches only rely on global scene characteristics by encoding holistic information for computational painting categorization. We argue that such approaches are sub-optimal and that discriminative common visual structures provide complementary information for painting classification. We present an approach that encodes both the global scene layout and discriminative latent common structures for computational painting categorization. The region of interests are automatically extracted, without any manual part labeling, by training class-specific deformable part-based models. Both holistic and region-of-interests are then described using multi-scale dense convolutional features. These features are pooled separately using Fisher vector encoding and concatenated afterwards in a single image representation. Experiments are performed on a challenging dataset with 91 different painters and 13 diverse painting styles. Our approach outperforms the standard method, which only employs the global scene characteristics. Furthermore, our method achieves state-of-the-art results outperforming a recent multi-scale deep features based approach [11] by 6.4% and 3.8% respectively on artist and style classification.
@inproceedings{diva2:1054664,
author = {Muhammad Anwer, Rao and Khan, Fahad and van de Weijer, Joost and Laaksonen, Jorma},
title = {{Combining Holistic and Part-based Deep Representations for Computational Painting Categorization}},
booktitle = {ICMR16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL},
year = {2016},
pages = {339--342},
publisher = {ASSOC COMPUTING MACHINERY},
}
Tracking keypoints through a sequence of images is one of the first major steps of structure-from-motion approaches. This paper presents a robust method to extend from two-frame keypoint matching to multi-frame tracking. The transition from matching to tracking is implemented through the propagation of the predicted 3D position of the keypoint. The uncertainty level of the tracking results is calculated based on the uncertainty of the motion parameters. Experiments on KITTI dataset as well as on a synthetic dataset show that dense and accurate tracking is attainable.
@inproceedings{diva2:971879,
author = {Fanani, Nolang and Mester, Rudolf},
title = {{Propagation based tracking with uncertainty measurement in automotive applications}},
booktitle = {2016 IEEE SOUTHWEST SYMPOSIUM ON IMAGE ANALYSIS AND INTERPRETATION (SSIAI)},
year = {2016},
series = {IEEE Southwest Symposium on Image Analysis and Interpretation},
pages = {117--120},
publisher = {IEEE},
}
Recent years have shown great progress in driving assistance systems, approaching autonomous driving step by step. Many approaches rely on lane markers however, which limits the system to larger paved roads and poses problems during winter. In this work we explore an alternative approach to visual road following based on online learning. The system learns the current visual appearance of the road while the vehicle is operated by a human. When driving onto a new type of road, the human driver will drive for a minute while the system learns. After training, the human driver can let go of the controls. The present work proposes a novel approach to online perception-action learning for the specific problem of road following, which makes interchangeably use of supervised learning (by demonstration), instantaneous reinforcement learning, and unsupervised learning (self-reinforcement learning). The proposed method, symbiotic online learning of associations and regression (SOLAR), extends previous work on qHebb-learning in three ways: priors are introduced to enforce mode selection and to drive learning towards particular goals, the qHebb-learning methods is complemented with a reinforcement variant, and a self-assessment method based on predictive coding is proposed. The SOLAR algorithm is compared to qHebb-learning and deep learning for the task of road following, implemented on a model RC-car. The system demonstrates an ability to learn to follow paved and gravel roads outdoors. Further, the system is evaluated in a controlled indoor environment which provides quantifiable results. The experiments show that the SOLAR algorithm results in autonomous capabilities that go beyond those of existing methods with respect to speed, accuracy, and functionality.
@inproceedings{diva2:947322,
author = {Öfjäll, Kristoffer and Felsberg, Michael and Robinson, Andreas},
title = {{Visual Autonomous Road Following by Symbiotic Online Learning}},
booktitle = {Intelligent Vehicles Symposium (IV), 2016 IEEE},
year = {2016},
pages = {136--143},
}
Phase correlation is one of the classic methods for sparse motion or displacement estimation. It is renowned in the literature for high precision and insensitivity against illumination variations. We propose several important enhancements to the phase correlation (PhC) method which render it more robust against those situations where a motion measurement is not possible (low structure, too much noise, too different image content in the corresponding measurement windows). This allows the method to perform self-diagnosis in adverse situations. Furthermore, we extend the PhC method by a robust scheme for detecting and classifying the presence of multiple motions and estimating their uncertainties. Experimental results on the Middlebury Stereo Dataset and on the KITTI Optical Flow Dataset show the potential offered by the enhanced method in contrast to the PhC implementation of OpenCV.
@inproceedings{diva2:927608,
author = {Ochs, Matthias and Bradler, Henry and Mester, Rudolf},
title = {{Enhanced Phase Correlation for Reliable and Robust Estimation of Multiple Motion Distributions}},
booktitle = {IMAGE AND VIDEO TECHNOLOGY, PSIVT 2015},
year = {2016},
series = {Lecture Notes in Computer Science},
pages = {368--379},
publisher = {Springer Publishing Company},
}
The Thermal Infrared Visual Object Tracking (VOT-TIR2015) Challenge was organized in conjunction with ICCV2015. It was the first benchmark on short-term,single-target tracking in thermal infrared (TIR) sequences. The challenge aimed at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. It was based on the VOT2013 Challenge, but introduced the following novelties: (i) the utilization of the LTIR (Linköping TIR) dataset, (ii) adaption of the VOT2013 attributes to thermal data, (iii) a similar evaluation to that of VOT2015. This paper provides an overview of the VOT-TIR2015 Challenge as well as the results of the 24 participating trackers.
@inproceedings{diva2:925830,
author = {Berg, Amanda and Felsberg, Michael and Häger, Gustav and Ahlberg, Jörgen},
title = {{An Overview of the Thermal Infrared Visual Object Tracking VOT-TIR2015 Challenge}},
booktitle = {Swedish Symposium on Image Analysis},
year = {2016},
series = {Svenska sällskapet för automatiserad bildanalys (SSBA)},
}
The Visual Object Tracking challenge 2015, VOT2015, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 62 trackers are presented. The number of tested trackers makes VOT 2015 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2015 challenge that go beyond its VOT2014 predecessor are: (i) a new VOT2015 dataset twice as large as in VOT2014 with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2014 evaluation methodology by introduction of a new performance measure. The dataset, the evaluation kit as well as the results are publicly available at the challenge website(1).
@inproceedings{diva2:1078694,
author = {Kristan, Matej and Matas, Jiri and Leonardis, Ales and Felsberg, Michael and Cehovin, Luka and Fernandez, Gustavo and Vojir, Tomas and Häger, Gustav and Nebehay, Georg and Pflugfelder, Roman and Gupta, Abhinav and Bibi, Adel and Lukezic, Alan and Garcia-Martins, Alvaro and Saffari, Amir and Petrosino, Alfredo and Solis Montero, Andres and Varfolomieiev, Anton and Baskurt, Atilla and Zhao, Baojun and Ghanem, Bernard and Martinez, Brais and Lee, ByeongJu and Han, Bohyung and Wang, Chaohui and Garcia, Christophe and Zhang, Chunyuan and Schmid, Cordelia and Tao, Dacheng and Kim, Daijin and Huang, Dafei and Prokhorov, Danil and Du, Dawei and Yeung, Dit-Yan and Ribeiro, Eraldo and Khan, Fahad and Porikli, Fatih and Bunyak, Filiz and Zhu, Gao and Seetharaman, Guna and Kieritz, Hilke and Tuen Yau, Hing and Li, Hongdong and Qi, Honggang and Bischof, Horst and Possegger, Horst and Lee, Hyemin and Nam, Hyeonseob and Bogun, Ivan and Jeong, Jae-chan and Cho, Jae-il and Lee, Jae-Young and Zhu, Jianke and Shi, Jianping and Li, Jiatong and Jia, Jiaya and Feng, Jiayi and Gao, Jin and Young Choi, Jin and Kim, Ji-Wan and Lang, Jochen and Martinez, Jose M. and Choi, Jongwon and Xing, Junliang and Xue, Kai and Palaniappan, Kannappan and Lebeda, Karel and Alahari, Karteek and Gao, Ke and Yun, Kimin and Hong Wong, Kin and Luo, Lei and Ma, Liang and Ke, Lipeng and Wen, Longyin and Bertinetto, Luca and Pootschi, Mandieh and Maresca, Mario and Danelljan, Martin and Wen, Mei and Zhang, Mengdan and Arens, Michael and Valstar, Michel and Tang, Ming and Chang, Ming-Ching and Haris Khan, Muhammad and Fan, Nana and Wang, Naiyan and Miksik, Ondrej and Torr, Philip H. S. and Wang, Qiang and Martin-Nieto, Rafael and Pelapur, Rengarajan and Bowden, Richard and Laganiere, Robert and Moujtahid, Salma and Hare, Sam and Hadfield, Simon and Lyu, Siwei and Li, Siyi and Zhu, Song-Chun and Becker, Stefan and Duffner, Stefan and Hicks, Stephen L. and Golodetz, Stuart and Choi, Sunglok and Wu, Tianfu and Mauthner, Thomas and Pridmore, Tony and Hu, Weiming and Hubner, Wolfgang and Wang, Xiaomeng and Li, Xin and Shi, Xinchu and Zhao, Xu and Mei, Xue and Shizeng, Yao and Hua, Yang and Li, Yang and Lu, Yang and Li, Yuezun and Chen, Zhaoyun and Huang, Zehua and Chen, Zhe and Zhang, Zhe and He, Zhenyu and Hong, Zhibin},
title = {{The Visual Object Tracking VOT2015 challenge results}},
booktitle = {Proceedings 2015 IEEE International Conference on Computer Vision Workshops ICCVW 2015},
year = {2015},
pages = {564--586},
publisher = {IEEE},
}
We present an approach to learn relative photometric differences between pairs of cameras, which have partially overlapping fields of views. This is an important problem, especially in appearance based methods to correspondence estimation or object identification in multi-camera systems where grey values observed by different cameras are processed. We model intensity differences among pairs of cameras by means of a low order polynomial (Gray Value Transfer Function - GVTF) which represents the characteristic curve of the mapping of grey values, s(i) produced by camera C-i to the corresponding grey values s(j) acquired with camera C-j. While the estimation of the GVTF parameters is straightforward once a set of truly corresponding pairs of grey values is available, the non trivial task in the GVTF estimation process solved in this paper is the extraction of corresponding grey value pairs in the presence of geometric and photometric errors. We also present a temporal GVTF update scheme to adapt to gradual global illumination changes, e.g., due to the change of daylight.
@inproceedings{diva2:1054654,
author = {Conrad, Christian and Mester, Rudolf},
title = {{Learning Relative Photometric Differences of Pairs of Cameras}},
booktitle = {2015 12TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS)},
year = {2015},
publisher = {IEEE},
}
The online-estimation of yaw, pitch, and roll of a moving vehicle is an important ingredient for systems which estimate egomotion, and 3D structure of the environment in a moving vehicle from video information. We present an approach to estimate these angular changes from monocular visual data, based on the fact that the motion of far distant points is not dependent on translation, but only on the current rotation of the camera. The presented approach does not require features (corners, edges,...) to be extracted. It allows to estimate in parallel also the illumination changes from frame to frame, and thus allows to largely stabilize the estimation of image correspondences and motion vectors, which are most often central entities needed for computating scene structure, distances, etc. The method is significantly less complex and much faster than a full egomotion computation from features, such as PTAM [6], but it can be used for providing motion priors and reduce search spaces for more complex methods which perform a complete analysis of egomotion and dynamic 3D structure of the scene in which a vehicle moves.
@inproceedings{diva2:971713,
author = {Barnada, Marc and Conrad, Christian and Bradler, Henry and Ochs, Matthias and Mester, Rudolf},
title = {{Estimation of Automotive Pitch, Yaw, and Roll using Enhanced Phase Correlation on Multiple Far-field Windows}},
booktitle = {2015 IEEE Intelligent Vehicles Symposium (IV)},
year = {2015},
pages = {481--486},
publisher = {IEEE},
}
The motion of a driving car is highly constrained and we claim that powerful predictors can be built that learn the typical egomotion statistics, and support the typical tasks of feature matching, tracking, and egomotion estimation. We analyze the statistics of the ground truth data given in the KITTI odometry benchmark sequences and confirm that a coordinated turn motion model, overlaid by moderate vibrations, is a very realistic model. We develop a predictor that is able to significantly reduce the uncertainty about the relative motion when a new image frame comes in. Such predictors can be used to steer the matching process from frame n to frame n + 1. We show that they can also be employed to detect outliers in the temporal sequence of egomotion parameters.
@inproceedings{diva2:955761,
author = {Bradler, Henry and Anne Wiegand, Birthe and Mester, Rudolf},
title = {{The Statistics of Driving Sequences - and what we can learn from them}},
booktitle = {2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW)},
year = {2015},
pages = {106--114},
publisher = {IEEE},
}
Tracking keypoints through a video sequence is a crucial first step in the processing chain of many visual SLAM approaches. This paper presents a robust initialization method to provide the initial match for a keypoint tracker, from the 1st frame where a keypoint is detected to the 2nd frame, that is: when no depth information is available. We deal explicitly with the case of long displacements. The starting position is obtained through an optimization that employs a distribution of motion priors based on pyramidal phase correlation, and epipolar geometry constraints. Experiments on the KITTI dataset demonstrate the significant impact of applying a motion prior to the matching. We provide detailed comparisons to the state-of-the-art methods.
@inproceedings{diva2:935987,
author = {Fanani, Nolang and Barnada, Marc and Mester, Rudolf},
title = {{Motion Priors Estimation for Robust Matching Initialization in Automotive Applications}},
booktitle = {Advances in Visual Computing},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {9474},
pages = {115--126},
publisher = {SPRINGER INT PUBLISHING AG},
}
Visual object tracking is a challenging computer vision problem with numerous real-world applications. This paper investigates the impact of convolutional features for the visual tracking problem. We propose to use activations from the convolutional layer of a CNN in discriminative correlation filter based tracking frameworks. These activations have several advantages compared to the standard deep features (fully connected layers). Firstly, they mitigate the need of task specific fine-tuning. Secondly, they contain structural information crucial for the tracking problem. Lastly, these activations have low dimensionality. We perform comprehensive experiments on three benchmark datasets: OTB, ALOV300++ and the recently introduced VOT2015. Surprisingly, different to image classification, our results suggest that activations from the first layer provide superior tracking performance compared to the deeper layers. Our results further show that the convolutional features provide improved results compared to standard handcrafted features. Finally, results comparable to state-of-theart trackers are obtained on all three benchmark datasets.
@inproceedings{diva2:933006,
author = {Danelljan, Martin and Häger, Gustav and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Convolutional Features for Correlation Filter Based Visual Tracking}},
booktitle = {2015 IEEE International Conference on Computer Vision Workshop (ICCVW)},
year = {2015},
pages = {621--629},
publisher = {IEEE conference proceedings},
}
During recent years, thermal cameras have decreased in both size and cost while improving image quality. The area of use for such cameras has expanded with many exciting applications, many of which require tracking of objects. While being subject to extensive research in the visual domain, tracking in thermal imagery has historically been of interest mainly for military purposes. The available thermal infrared datasets for evaluating methods addressing these problems are few and the ones that do are not challenging enough for today’s tracking algorithms. Therefore, we hereby propose a thermal infrared dataset for evaluation of short-term tracking methods. The dataset consists of 20 sequences which have been collected from multiple sources and the data format used is in accordance with the Visual Object Tracking (VOT) Challenge.
@inproceedings{diva2:925818,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{A thermal infrared dataset for evaluation of short-term tracking methods}},
booktitle = {Swedish Symposium on Image Analysis},
year = {2015},
series = {Svenska sällskapet för automatiserad bildanalys (SSBA)},
}
Reliable detection of obstacles at long range is crucial for the timely response to hazards by fast-moving safety-critical platforms like autonomous cars. We present a novel method for the joint detection and localization of distant obstacles using a stereo vision system on a moving platform. The approach is applicable to both static and moving obstacles and pushes the limits of detection performance as well as localization accuracy. The proposed detection algorithm is based on sound statistical tests using local geometric criteria which implicitly consider non-flat ground surfaces. To achieve maximum performance, it operates directly on image data instead of precomputed stereo disparity maps. A careful experimental evaluation on several datasets shows excellent detection performance and localization accuracy up to very large distances, even for small obstacles. We demonstrate a parallel implementation of the proposed system on a GPU that executes at real-time speeds.
@inproceedings{diva2:919357,
author = {Pinggera, Peter and Franke, Uwe and Mester, Rudolf},
title = {{High-Performance Long Range Obstacle Detection Using Stereo Vision}},
booktitle = {2015 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS)},
year = {2015},
series = {IEEE International Conference on Intelligent Robots and Systems},
pages = {1308--1313},
publisher = {IEEE},
}
The Thermal Infrared Visual Object Tracking challenge 2015, VOTTIR2015, aims at comparing short-term single-object visual trackers that work on thermal infrared (TIR) sequences and do not apply prelearned models of object appearance. VOT-TIR2015 is the first benchmark on short-term tracking in TIR sequences. Results of 24 trackers are presented. For each participating tracker, a short description is provided in the appendix. The VOT-TIR2015 challenge is based on the VOT2013 challenge, but introduces the following novelties: (i) the newly collected LTIR (Linköping TIR) dataset is used, (ii) the VOT2013 attributes are adapted to TIR data, (iii) the evaluation is performed using insights gained during VOT2013 and VOT2014 and is similar to VOT2015.
@inproceedings{diva2:917646,
author = {Felsberg, Michael and Berg, Amanda and Häger, Gustav and Ahlberg, Jörgen and Kristan, Matej and Matas, Jiri and Leonardis, Ales and Cehovin, Luka and Fernandez, Gustavo and Vojır, Tomas and Nebehay, Georg and Pflugfelder, Roman and Lukezic, Alan and Garcia-Martin8, Alvaro and Saffari, Amir and Li, Ang and Solıs Montero, Andres and Zhao, Baojun and Schmid, Cordelia and Chen, Dapeng and Du, Dawei and Shahbaz Khan, Fahad and Porikli, Fatih and Zhu, Gao and Zhu, Guibo and Lu, Hanqing and Kieritz, Hilke and Li, Hongdong and Qi, Honggang and Jeong, Jae-chan and Cho, Jae-il and Lee, Jae-Yeong and Zhu, Jianke and Li, Jiatong and Feng, Jiayi and Wang, Jinqiao and Kim, Ji-Wan and Lang, Jochen and Martinez, Jose M. and Xue, Kai and Alahari, Karteek and Ma, Liang and Ke, Lipeng and Wen, Longyin and Bertinetto, Luca and Danelljan, Martin and Arens, Michael and Tang, Ming and Chang, Ming-Ching and Miksik, Ondrej and Torr, Philip H S and Martin-Nieto, Rafael and Laganiere, Robert and Hare, Sam and Lyu, Siwei and Zhu, Song-Chun and Becker, Stefan and Hicks, Stephen L and Golodetz, Stuart and Choi, Sunglok and Wu, Tianfu and Hubner, Wolfgang and Zhao, Xu and Hua, Yang and Li, Yang and Lu, Yang and Li, Yuezun and Yuan, Zejian and Hong, Zhibin},
title = {{The Thermal Infrared Visual Object Tracking VOT-TIR2015 Challenge Results}},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision},
year = {2015},
series = {IEEE International Conference on Computer Vision. Proceedings},
pages = {639--651},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
In this work we derive a novel framework rendering measured distributions into approximated distributions of their mean. This is achieved by exploiting constraints imposed by the Gauss-Markov theorem from estimation theory, being valid for mono-modal Gaussian distributions. It formulates the relation between the variance of measured samples and the so-called standard error, being the standard deviation of their mean. However, multi-modal distributions are present in numerous image processing scenarios, e.g. local gray value or color distributions at object edges, or orientation or displacement distributions at occlusion boundaries in motion estimation or stereo. Our method not only aims at estimating the modes of these distributions together with their standard error, but at describing the whole multi-modal distribution. We utilize the method of channel representation, a kind of soft histogram also known as population codes, to represent distributions in a non-parametric, generic fashion. Here we apply the proposed scheme to general mono- and multimodal Gaussian distributions to illustrate its effectiveness and compliance with the Gauss-Markov theorem.
@inproceedings{diva2:904084,
author = {Åström, Freddie and Felsberg, Michael and Scharr, Hanno},
title = {{Adaptive sharpening of multimodal distributions}},
booktitle = {Colour and Visual Computing Symposium (CVCS), 2015},
year = {2015},
publisher = {IEEE},
}
Visual odometry is one of the most active topics in computer vision. The automotive industry is particularly interested in this field due to the appeal of achieving a high degree of accuracy with inexpensive sensors such as cameras. The best results on this task are currently achieved by systems based on a calibrated stereo camera rig, whereas monocular systems are generally lagging behind in terms of performance. We hypothesise that this is due to stereo visual odometry being an inherently easier problem, rather than than due to higher quality of the state of the art stereo based algorithms. Under this hypothesis, techniques developed for monocular visual odometry systems would be, in general, more refined and robust since they have to deal with an intrinsically more difficult problem. In this work we present a novel stereo visual odometry system for automotive applications based on advanced monocular techniques. We show that the generalization of these techniques to the stereo case result in a significant improvement of the robustness and accuracy of stereo based visual odometry. We support our claims by the system results on the well known KITTI benchmark, achieving the top rank for visual only systems∗ .
@inproceedings{diva2:859674,
author = {Persson, Mikael and Piccini, Tommaso and Felsberg, Michael and Mester, Rudolf},
title = {{Robust Stereo Visual Odometry from Monocular Techniques}},
booktitle = {2015 IEEE Intelligent Vehicles Symposium (IV)},
year = {2015},
series = {Intelligent Vehicle, IEEE Symposium},
pages = {686--691},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Robust and accurate visual tracking is one of the most challenging computer vision problems. Due to the inherent lack of training data, a robust approach for constructing a target appearance model is crucial. Recently, discriminatively learned correlation filters (DCF) have been successfully applied to address this problem for tracking. These methods utilize a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. However, the periodic assumption also introduces unwanted boundary effects, which severely degrade the quality of the tracking model.
We propose Spatially Regularized Discriminative Correlation Filters (SRDCF) for tracking. A spatial regularization component is introduced in the learning to penalize correlation filter coefficients depending on their spatial location. Our SRDCF formulation allows the correlation filters to be learned on a significantly larger set of negative training samples, without corrupting the positive samples. We further propose an optimization strategy, based on the iterative Gauss-Seidel method, for efficient online learning of our SRDCF. Experiments are performed on four benchmark datasets: OTB-2013, ALOV++, OTB-2015, and VOT2014. Our approach achieves state-of-the-art results on all four datasets. On OTB-2013 and OTB-2015, we obtain an absolute gain of 8.0% and 8.2% respectively, in mean overlap precision, compared to the best existing trackers.
@inproceedings{diva2:857265,
author = {Danelljan, Martin and Häger, Gustav and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Learning Spatially Regularized Correlation Filters for Visual Tracking}},
booktitle = {Proceedings of the International Conference in Computer Vision (ICCV), 2015},
year = {2015},
series = {IEEE International Conference on Computer Vision. Proceedings},
pages = {4310--4318},
publisher = {IEEE Computer Society},
}
Describing persons and their actions is a challenging problem due to variations in pose, scale and viewpoint in real-world images. Recently, semantic pyramids approach [1] for pose normalization has shown to provide excellent results for gender and action recognition. The performance of semantic pyramids approach relies on robust image description and is therefore limited due to the use of shallow local features. In the context of object recognition [2] and object detection [3], convolutional neural networks (CNNs) or deep features have shown to improve the performance over the conventional shallow features.
We propose deep semantic pyramids for human attributes and action recognition. The method works by constructing spatial pyramids based on CNNs of different part locations. These pyramids are then combined to obtain a single semantic representation. We validate our approach on the Berkeley and 27 Human Attributes datasets for attributes classification. For action recognition, we perform experiments on two challenging datasets: Willow and PASCAL VOC 2010. The proposed deep semantic pyramids provide a significant gain of 17.2%, 13.9%, 24.3% and 22.6% compared to the standard shallow semantic pyramids on Berkeley, 27 Human Attributes, Willow and PASCAL VOC 2010 datasets respectively. Our results also show that deep semantic pyramids outperform conventional CNNs based on the full bounding box of the person. Finally, we compare our approach with state-of-the-art methods and show a gain in performance compared to best methods in literature.
@inproceedings{diva2:857230,
author = {Khan, Fahad Shahbaz and Rao, Muhammad Anwer and van de Weijer, Joost and Felsberg, Michael and Laaksonen, Jorma},
title = {{Deep Semantic Pyramids for Human Attributes and Action Recognition}},
booktitle = {Image Analysis},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {9127},
pages = {341--353},
publisher = {Springer},
}
@inproceedings{diva2:856870,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Online learning of autonomous driving using channel representations of multi-modal joint distributions}},
booktitle = {Proceedings of SSBA, Swedish Symposium on Image Analysis, 2015},
year = {2015},
publisher = {Swedish Society for automated image analysis},
}
Panorama stitching of sparsely structured scenes is an open research problem. In this setting, feature-based image alignment methods often fail due to shortage of distinct image features. Instead, direct image alignment methods, such as those based on phase correlation, can be applied. In this paper we investigate correlation-based image alignment techniques for panorama stitching of sparsely structured scenes. We propose a novel image alignment approach based on discriminative correlation filters (DCF), which has recently been successfully applied to visual tracking. Two versions of the proposed DCF-based approach are evaluated on two real and one synthetic panorama dataset of sparsely structured indoor environments. All three datasets consist of images taken on a tripod rotating 360 degrees around the vertical axis through the optical center. We show that the proposed DCF-based methods outperform phase correlation-based approaches on these datasets.
@inproceedings{diva2:856868,
author = {Meneghetti, Giulia and Danelljan, Martin and Felsberg, Michael and Nordberg, Klas},
title = {{Image alignment for panorama stitching in sparsely structured environments}},
booktitle = {Image Analysis. SCIA 2015.},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {9127},
pages = {428--439},
publisher = {Springer},
}
An open issue in multiple view geometry and structure from motion, applied to real life scenarios, is the sparsity of the matched key-points and of the reconstructed point cloud. We present an approach that can significantly improve the density of measured displacement vectors in a sparse matching or tracking setting, exploiting the partial information of the motion field provided by linear oriented image patches (edgels). Our approach assumes that the epipolar geometry of an image pair already has been computed, either in an earlier feature-based matching step, or by a robustified differential tracker. We exploit key-points of a lower order, edgels, which cannot provide a unique 2D matching, but can be employed if a constraint on the motion is already given. We present a method to extract edgels, which can be effectively tracked given a known camera motion scenario, and show how a constrained version of the Lucas-Kanade tracking procedure can efficiently exploit epipolar geometry to reduce the classical KLT optimization to a 1D search problem. The potential of the proposed methods is shown by experiments performed on real driving sequences.
@inproceedings{diva2:856772,
author = {Piccini, Tommaso and Persson, Mikael and Nordberg, Klas and Felsberg, Michael and Mester, Rudolf},
title = {{Good Edgels to Track: Beating the Aperture Problem with Epipolar Geometry}},
booktitle = {COMPUTER VISION - ECCV 2014 WORKSHOPS, PT II},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {8926},
pages = {652--664},
publisher = {Elsevier},
}
In recent years, short-term single-object tracking has emerged has a popular research topic, as it constitutes the core of more general tracking systems. Many such tracking methods are based on matching a part of the image with a template that is learnt online and represented by, for example, a correlation filter or a distribution field. In order for such a tracker to be able to not only find the position, but also the scale, of the tracked object in the next frame, some kind of scale estimation step is needed. This step is sometimes separate from the position estimation step, but is nevertheless jointly evaluated in de facto benchmarks. However, for practical as well as scientific reasons, the scale estimation step should be evaluated separately – for example,theremightincertainsituationsbeothermethodsmore suitable for the task. In this paper, we describe an evaluation method for scale estimation in template-based short-term single-object tracking, and evaluate two state-of-the-art tracking methods where estimation of scale and position are separable.
@inproceedings{diva2:853786,
author = {Ahlberg, Jörgen and Berg, Amanda},
title = {{Evaluating Template Rescaling in Short-Term Single-Object Tracking}},
booktitle = {17th IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), Karlsruhe, Germany, August 25, 2015},
year = {2015},
publisher = {IEEE},
}
The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http://votchallenge.net).
@inproceedings{diva2:850764,
author = {Kristan, Matej and Pflugfelder, Roman P. and Leonardis, Ales and Matas, Jiri and Cehovin, Luka and Nebehay, Georg and Vojir, Tomas and Fernandez, Gustavo and Lukezi, Alan and Dimitriev, Aleksandar and Petrosino, Alfredo and Saffari, Amir and Li, Bo and Han, Bohyung and Heng, CherKeng and Garcia, Christophe and Pangersic, Dominik and Häger, Gustav and Khan, Fahad Shahbaz and Oven, Franci and Possegger, Horst and Bischof, Horst and Nam, Hyeonseob and Zhu, Jianke and Li, JiJia and Choi, Jin Young and Choi, Jin-Woo and Henriques, Joao F. and van de Weijer, Joost and Batista, Jorge and Lebeda, Karel and Ofjall, Kristoffer and Yi, Kwang Moo and Qin, Lei and Wen, Longyin and Maresca, Mario Edoardo and Danelljan, Martin and Felsberg, Michael and Cheng, Ming-Ming and Torr, Philip and Huang, Qingming and Bowden, Richard and Hare, Sam and YueYing Lim, Samantha and Hong, Seunghoon and Liao, Shengcai and Hadfield, Simon and Li, Stan Z. and Duffner, Stefan and Golodetz, Stuart and Mauthner, Thomas and Vineet, Vibhav and Lin, Weiyao and Li, Yang and Qi, Yuankai and Lei, Zhen and Niu, ZhiHeng},
title = {{The Visual Object Tracking VOT2014 Challenge Results}},
booktitle = {COMPUTER VISION - ECCV 2014 WORKSHOPS, PT II},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {8926},
pages = {191--217},
publisher = {Springer},
}
Visual object tracking is a classical, but still open research problem in computer vision, with many real world applications. The problem is challenging due to several factors, such as illumination variation, occlusions, camera motion and appearance changes. Such problems can be alleviated by constructing robust, discriminative and computationally efficient visual features. Recently, biologically-inspired channel representations \cite{felsberg06PAMI} have shown to provide promising results in many applications ranging from autonomous driving to visual tracking.
This paper investigates the problem of coloring channel representations for visual tracking. We evaluate two strategies, channel concatenation and channel product, to construct channel coded color representations. The proposed channel coded color representations are generic and can be used beyond tracking.
Experiments are performed on 41 challenging benchmark videos. Our experiments clearly suggest that a careful selection of color feature together with an optimal fusion strategy, significantly outperforms the standard luminance based channel representation. Finally, we show promising results compared to state-of-the-art tracking methods in the literature.
@inproceedings{diva2:850742,
author = {Danelljan, Martin and Häger, Gustav and Khan, Fahad Shahbaz and Felsberg, Michael},
title = {{Coloring Channel Representations for Visual Tracking}},
booktitle = {19th Scandinavian Conference, SCIA 2015, Copenhagen, Denmark, June 15-17, 2015. Proceedings},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {9127},
pages = {117--129},
publisher = {Springer},
}
Short-term single-object (STSO) tracking in thermal images is a challenging problem relevant in a growing number of applications. In order to evaluate STSO tracking algorithms on visual imagery, there are de facto standard benchmarks. However, we argue that tracking in thermal imagery is different than in visual imagery, and that a separate benchmark is needed. The available thermal infrared datasets are few and the existing ones are not challenging for modern tracking algorithms. Therefore, we hereby propose a thermal infrared benchmark according to the Visual Object Tracking (VOT) protocol for evaluation of STSO tracking methods. The benchmark includes the new LTIR dataset containing 20 thermal image sequences which have been collected from multiple sources and annotated in the format used in the VOT Challenge. In addition, we show that the ranking of different tracking principles differ between the visual and thermal benchmarks, confirming the need for the new benchmark.
@inproceedings{diva2:850688,
author = {Berg, Amanda and Ahlberg, Jörgen and Felsberg, Michael},
title = {{A Thermal Object Tracking Benchmark}},
booktitle = {12th IEEE International Conference on Advanced Video- and Signal-based Surveillance, Karlsruhe, Germany, August 25-28 2015},
year = {2015},
publisher = {IEEE},
}
We propose a technique for joint calibration of a wide-angle rolling shutter camera (e.g. a GoPro) and an externally mounted gyroscope. The calibrated parameters are time scaling and offset, relative pose between gyroscope and camera, and gyroscope bias. The parameters are found using non-linear least squares minimisation using the symmetric transfer error as cost function. The primary contribution is methods for robust initialisation of the relative pose and time offset, which are essential for convergence. We also introduce a robust error norm to handle outliers. This results in a technique that works with general video content and does not require any specific setup or calibration patterns. We apply our method to stabilisation of videos recorded by a rolling shutter camera, with a rigidly attached gyroscope. After recording, the gyroscope and camera are jointly calibrated using the recorded video itself. The recorded video can then be stabilised using the calibrated parameters. We evaluate the technique on video sequences with varying difficulty and motion frequency content. The experiments demonstrate that our method can be used to produce high quality stabilised videos even under difficult conditions, and that the proposed initialisation is shown to end up within the basin of attraction. We also show that a residual based on the symmetric transfer error is more accurate than residuals based on the recently proposed epipolar plane normal coplanarity constraint.
@inproceedings{diva2:841497,
author = {Ovr\'{e}n, Hannes and Forss\'{e}n, Per-Erik},
title = {{Gyroscope-based video stabilisation with auto-calibration}},
booktitle = {2015 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA)},
year = {2015},
series = {IEEE International Conference on Robotics and Automation ICRA},
pages = {2090--2097},
}
We propose a method for detecting obstacles on the railway in front of a moving train using a monocular thermal camera. The problem is motivated by the large number of collisions between trains and various obstacles, resulting in reduced safety and high costs. The proposed method includes a novel way of detecting the rails in the imagery, as well as a way to detect anomalies on the railway. While the problem at a first glance looks similar to road and lane detection, which in the past has been a popular research topic, a closer look reveals that the problem at hand is previously unaddressed. As a consequence, relevant datasets are missing as well, and thus our contribution is two-fold: We propose an approach to the novel problem of obstacle detection on railways and we describe the acquisition of a novel data set.
@inproceedings{diva2:824491,
author = {Berg, Amanda and Öfjäll, Kristoffer and Ahlberg, Jörgen and Felsberg, Michael},
title = {{Detecting Rails and Obstacles Using a Train-Mounted Thermal Camera}},
booktitle = {Image Analysis},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {9127},
pages = {492--503},
publisher = {Springer},
}
In this article we provide an overview of color name applications in computer vision. Color names are linguistic labels which humans use to communicate color. Computational color naming learns a mapping from pixels values to color names. In recent years color names have been applied to a wide variety of computer vision applications, including image classification, object recognition, texture classification, visual tracking and action recognition. Here we provide an overview of these results which show that in general color names outperform photometric invariants as a color representation.
@inproceedings{diva2:818099,
author = {van de Weijer, Joost and Khan, Fahad},
title = {{An Overview of Color Name Applications in Computer Vision}},
booktitle = {COMPUTATIONAL COLOR IMAGING, CCIW 2015},
year = {2015},
pages = {16--22},
publisher = {Springer Verlag (Germany)},
}
Micro unmanned aerial vehicles are becoming increasingly interesting for aiding and collaborating with human agents in myriads of applications, but in particular they are useful for monitoring inaccessible or dangerous areas. In order to interact with and monitor humans, these systems need robust and real-time computer vision subsystems that allow to detect and follow persons.
In this work, we propose a low-level active vision framework to accomplish these challenging tasks. Based on the LinkQuad platform, we present a system study that implements the detection and tracking of people under fully autonomous flight conditions, keeping the vehicle within a certain distance of a person. The framework integrates state-of-the-art methods from visual detection and tracking, Bayesian filtering, and AI-based control. The results from our experiments clearly suggest that the proposed framework performs real-time detection and tracking of persons in complex scenarios
@inproceedings{diva2:796839,
author = {Danelljan, Martin and Khan, Fahad Shahbaz and Felsberg, Michael and Granström, Karl and Heintz, Fredrik and Rudol, Piotr and Wzorek, Mariusz and Kvarnström, Jonas and Doherty, Patrick},
title = {{A Low-Level Active Vision Framework for Collaborative Unmanned Aircraft Systems}},
booktitle = {COMPUTER VISION - ECCV 2014 WORKSHOPS, PT I},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {8925},
pages = {223--237},
publisher = {Springer Publishing Company},
}
We present a novel approach for segmenting different motions from 3D trajectories. Our approach uses the theory of transformation groups to derive a set of invariants of 3D points located on the same rigid object. These invariants are inexpensive to calculate, involving primarily QR factorizations of small matrices. The invariants are easily converted into a set of robust motion affinities and with the use of a local sampling scheme and spectral clustering, they can be incorporated into a highly efficient motion segmentation algorithm. We have also captured a new multi-object 3D motion dataset, on which we have evaluated our approach, and compared against state-of-the-art competing methods from literature. Our results show that our approach outperforms all methods while being robust to perspective distortions and degenerate configurations.
@inproceedings{diva2:789181,
author = {Zografos, Vasileios and Lenz, Reiner and Ringaby, Erik and Felsberg, Michael and Nordberg, Klas},
title = {{Fast segmentation of sparse 3D point trajectories using group theoretical invariants}},
booktitle = {COMPUTER VISION - ACCV 2014, PT IV},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {9006},
pages = {675--691},
publisher = {Springer},
}
There are three major issues for visual object trackers: modelrepresentation, search and model update. In this paper we address thelast two issues for a specic model representation, grid based distributionmodels by means of channel-based distribution elds. Particularly weaddress the comparison part of searching. Previous work in the areahas used standard methods for comparison and update, not exploitingall the possibilities of the representation. In this work we propose twocomparison schemes and one update scheme adapted to the distributionmodel. The proposed schemes signicantly improve the accuracy androbustness on the Visual Object Tracking (VOT) 2014 Challenge dataset.
@inproceedings{diva2:787563,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Weighted Update and Comparison for Channel-Based Distribution Field Tracking}},
booktitle = {COMPUTER VISION - ECCV 2014 WORKSHOPS, PT II},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {8926},
pages = {218--231},
publisher = {Springer},
}
We present a novel variational approach to a tensor-based total variation formulation which is called gradient energy total variation, GETV. We introduce the gradient energy tensor into the GETV and show that the corresponding Euler-Lagrange (E-L) equation is a tensor-based partial differential equation of total variation type. Furthermore, we give a proof which shows that GETV is a convex functional. This approach, in contrast to the commonly used structure tensor, enables a formal derivation of the corresponding E-L equation. Experimental results suggest that GETV compares favourably to other state of the art variational denoising methods such as extended anisotropic diffusion (EAD) and total variation (TV) for gray-scale and colour images.
@inproceedings{diva2:764721,
author = {Åström, Freddie and Baravdish, George and Felsberg, Michael},
title = {{A Tensor Variational Formulation of Gradient Energy Total Variation}},
booktitle = {ENERGY MINIMIZATION METHODS IN COMPUTER VISION AND PATTERN RECOGNITION, EMMCVPR 2015},
year = {2015},
series = {Lecture Notes in Computer Science},
pages = {307--320},
publisher = {Springer Berlin/Heidelberg},
}
Many image processing methods such as corner detection,optical flow and iterative enhancement make use of image tensors. Generally, these tensors are estimated using the structure tensor. In this work we show that the gradient energy tensor can be used as an alternativeto the structure tensor in several cases. We apply the gradient energy tensor to common image problem applications such as corner detection, optical flow and image enhancement. Our experimental results suggest that the gradient energy tensor enables real-time tensor-based image enhancement using the graphical processing unit (GPU) and we obtain 40% increase of frame rate without loss of image quality.
@inproceedings{diva2:758278,
author = {Åström, Freddie and Felsberg, Michael},
title = {{On the Choice of Tensor Estimation for Corner Detection, Optical Flow and Denoising}},
booktitle = {COMPUTER VISION - ACCV 2014 WORKSHOPS, PT II},
year = {2015},
series = {Lecture Notes in Computer Science},
volume = {9009},
pages = {16--30},
publisher = {Springer},
}
In this paper we address the problem of automatically detecting leakages in underground pipes of district heating networks from images captured by an airborne thermal camera. The basic idea is to classify each relevant image region as a leakage if its temperature exceeds a threshold. This simple approach yields a significant number of false positives. We propose to address this issue by machine learning techniques and provide extensive experimental analysis on real-world data. The results show that this postprocessing step significantly improves the usefulness of the system.
@inproceedings{diva2:925813,
author = {Berg, Amanda and Ahlberg, Jörgen},
title = {{Classifying district heating network leakages in aerial thermal imagery}},
booktitle = {Swedish Symposium on Image Analysis},
year = {2014},
series = {Svenska sällskapet för automatiserad bildanalys (SSBA)},
}
We discuss matching measures (scores and residuals) for comparing image patches under unknown affine photometric (=intensity) transformations. In contrast to existing methods, we derive a fully symmetric matching measure which reflects the fact that both copies of the signal are affected by measurement errors (noise), not only one. As it turns out, this evolves into an eigensystem problem; however a simple direct solution for all entities of interest can be given. We strongly advocate for constraining the estimated gain ratio and the estimated mean value offset to realistic ranges, thus preventing the matching scheme from locking into unrealistic correspondences.
@inproceedings{diva2:853439,
author = {Mester, Rudolf and Conrad, Christian},
title = {{When patches match - a statistical view on matching under illumination variation}},
booktitle = {2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)},
year = {2014},
series = {International Conference on Pattern Recognition},
pages = {4364--4369},
publisher = {IEEE COMPUTER SOC},
}
The present paper analyzes some previously unexplored aspects of motion estimation that are fundamental both for discrete block matching as well as for differential optical flow approaches a la Lucas-Kanade. It aims at providing a complete estimation-theoretic approach that makes the assumptions about noisy observations of samples from a continuous signal of a certain class explicit. It turns out that motion estimation is a combination of simultaneously estimating the true underlying continuous signal and optimizing the displacement between two hypothetical copies of this unknown signal. Practical schemes such as the current variants of Lucas-Kanade are just approximations to the fundamental estimation problem identified in the present paper. Derivatives appear as derivatives to the continuous signal representation kernels, not as ad hoc discrete derivative masks. The formulation via an explicit signal space defined by kernels is a precondition for analyzing e.g. the convergence range of iterative displacement estimation procedures, and for systematically chosing preconditioning filters. The paper sets the stage for further in-depth analysis of some fundamental issues that have so far been overlooked or ignored in motion analysis.
@inproceedings{diva2:824773,
author = {Mester, Rudolf},
title = {{Motion Estimation Revisited: an Estimation-Theoretic Approach}},
booktitle = {2014 IEEE SOUTHWEST SYMPOSIUM ON IMAGE ANALYSIS AND INTERPRETATION (SSIAI 2014)},
year = {2014},
pages = {113--116},
publisher = {IEEE},
}
Recognizing human actions in still images is a challenging problem in computer vision due to significant amount of scale, illumination and pose variation. Given the bounding box of a person both at training and test time, the task is to classify the action associated with each bounding box in an image. Most state-of-the-art methods use the bag-of-words paradigm for action recognition. The bag-of-words framework employing a dense multi-scale grid sampling strategy is the de facto standard for feature detection. This results in a scale invariant image representation where all the features at multiple-scales are binned in a single histogram. We argue that such a scale invariant strategy is sub-optimal since it ignores the multi-scale information available with each bounding box of a person. This paper investigates alternative approaches to scale coding for action recognition in still images. We encode multi-scale information explicitly in three different histograms for small, medium and large scale visual-words. Our first approach exploits multi-scale information with respect to the image size. In our second approach, we encode multi-scale information relative to the size of the bounding box of a person instance. In each approach, the multi-scale histograms are then concatenated into a single representation for action classification. We validate our approaches on the Willow dataset which contains seven action categories: interacting with computer, photography, playing music, riding bike, riding horse, running and walking. Our results clearly suggest that the proposed scale coding approaches outperform the conventional scale invariant technique. Moreover, we show that our approach obtains promising results compared to more complex state-of-the-art methods.
@inproceedings{diva2:801569,
author = {Khan, Fahad and Van, De Weijer J. and Bagdanov, A.D. and Felsberg, Michael},
title = {{Scale coding bag-of-words for action recognition}},
booktitle = {Pattern Recognition (ICPR), 2014 22nd International Conference on},
year = {2014},
series = {International Conference on Pattern Recognition},
pages = {1514--1519},
publisher = {Institute of Electrical and Electronics Engineers Inc.},
}
We describe a system for active stabilization of cameras mounted on highly dynamic robots. To focus on careful performance evaluation of the stabilization algorithm, we use a camera mounted on a robotic test platform that can have unknown perturbations in the horizontal plane, a commonly occurring scenario in mobile robotics. We show that the camera can be eectively stabilized using an inertial sensor and a single additional motor, without a joint position sensor. The algorithm uses an adaptive controller based on a model of the vertebrate Cerebellum for velocity stabilization, with additional drift correction. We have alsodeveloped a resolution adaptive retinal slip algorithm that is robust to motion blur.
We evaluated the performance quantitatively using another high speed robot to generate repeatable sequences of large and fast movements that a gaze stabilization system can attempt to counteract. Thanks to the high-accuracy repeatability, we can make a fair comparison of algorithms for gaze stabilization. We show that the resulting system can reduce camera image motion to about one pixel per frame on average even when the platform is rotated at 200 degrees per second. As a practical application, we also demonstrate how the common task of face detection benets from active gaze stabilization.
@inproceedings{diva2:789200,
author = {Lesmana, Martin and Landgren, Axel and Forss\'{e}n, Per-Erik and Pai, Dinesh K.},
title = {{Active Gaze Stabilization}},
booktitle = {Proceedings of the 2014 Indian Conference on Computer Vision Graphics and Image Processing},
year = {2014},
pages = {81:1--81:8},
publisher = {ACM Digital Library},
}
@inproceedings{diva2:787561,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Online Learning and Mode Switching for Autonomous Driving from Demonstration}},
booktitle = {Proceedings of SSBA, Swedish Symposium on Image Analysis, 2014},
year = {2014},
}
Robust scale estimation is a challenging problem in visual object tracking. Most existing methods fail to handle large scale variations in complex image sequences. This paper presents a novel approach for robust scale estimation in a tracking-by-detection framework. The proposed approach works by learning discriminative correlation filters based on a scale pyramid representation. We learn separate filters for translation and scale estimation, and show that this improves the performance compared to an exhaustive scale search. Our scale estimation approach is generic as it can be incorporated into any tracking method with no inherent scale estimation.
Experiments are performed on 28 benchmark sequences with significant scale variations. Our results show that the proposed approach significantly improves the performance by 18.8 % in median distance precision compared to our baseline. Finally, we provide both quantitative and qualitative comparison of our approach with state-of-the-art trackers in literature. The proposed method is shown to outperform the best existing tracker by 16.6 % in median distance precision, while operating at real-time.
@inproceedings{diva2:785778,
author = {Danelljan, Martin and Häger, Gustav and Khan, Fahad and Felsberg, Michael},
title = {{Accurate Scale Estimation for Robust Visual Tracking}},
booktitle = {Proceedings of the British Machine Vision Conference 2014},
year = {2014},
publisher = {BMVA Press},
}
District heating pipes are known to degenerate with time and in some cities the pipes have been used for several decades. Due to bad insulation or cracks, energy or media leakages might appear. This paper presents a complete system for large-scale monitoring of district heating networks, including methods for detection, classification and temporal characterization of (potential) leakages. The system analyses thermal infrared images acquired by an aircraft-mounted camera, detecting the areas for which the pixel intensity is higher than normal. Unfortunately, the system also finds many false detections, i.e., warm areas that are not caused by media or energy leakages. Thus, in order to reduce the number of false detections we describe a machine learning method to classify the detections. The results, based on data from three district heating networks show that we can remove more than half of the false detections. Moreover, we also propose a method to characterize leakages over time, that is, repeating the image acquisition one or a few years later and indicate areas that suffer from an increased energy loss.
@inproceedings{diva2:776415,
author = {Berg, Amanda and Ahlberg, Jörgen},
title = {{Classification and temporal analysis of district heating leakages in thermal images}},
booktitle = {Proceedings of The 14th International Symposium on District Heating and Cooling},
year = {2014},
}
We address the problem of reducing the number offalse alarms among automatically detected leakages in districtheating networks. The leakages are detected in images capturedby an airborne thermal camera, and each detection correspondsto an image region with abnormally high temperature. Thisapproach yields a significant number of false positives, and wepropose to reduce this number in two steps. First, we use abuilding segmentation scheme in order to remove detectionson buildings. Second, we extract features from the detectionsand use a Random forest classifier on the remaining detections.We provide extensive experimental analysis on real-world data,showing that this post-processing step significantly improves theusefulness of the system.
@inproceedings{diva2:776248,
author = {Berg, Amanda and Ahlberg, Jörgen},
title = {{Classification of leakage detections acquired by airborne thermography of district heating networks}},
booktitle = {2014 8th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS)},
year = {2014},
series = {IAPR Workshop on Pattern Recognition in Remote Sensing},
pages = {1--4},
publisher = {IEEE},
}
Estimating the position of a 3-dimensional world point given its 2-dimensional projections in a set of images is a key component in numerous computer vision systems. There are several methods dealing with this problem, ranging from sub-optimal, linear least square triangulation in two views, to finding the world point that minimized the L2-reprojection error in three views. This leads to the statistically optimal estimate under the assumption of Gaussian noise. In this paper we present a solution to the optimal triangulation in three views. The standard approach for solving the three-view triangulation problem is to find a closed-form solution. In contrast to this, we propose a new method based on an iterative scheme. The method is rigorously tested on both synthetic and real image data with corresponding ground truth, on a midrange desktop PC and a Raspberry Pi, a low-end mobile platform. We are able to improve the precision achieved by the closed-form solvers and reach a speed-up of two orders of magnitude compared to the current state-of-the-art solver. In numbers, this amounts to around 300K triangulations per second on the PC and 30K triangulations per second on Raspberry Pi.
@inproceedings{diva2:756974,
author = {Hedborg, Johan and Robinson, Andreas and Felsberg, Michael},
title = {{Robust Three-View Triangulation Done Fast}},
booktitle = {Proceedings: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2014},
year = {2014},
series = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops},
pages = {152--157},
publisher = {IEEE},
}
While autonomously driving systems accumulate more and more sensors as well as highly specialized visual features and engineered solutions, the human visual system provides evidence that visual input and simple low level image features are sufficient for successful driving. In this paper we propose extensions (non-linear update and coherence weighting) to one of the simplest biologically inspired learning schemes (Hebbian learning). We show that this is sufficient for online learning of visual autonomous driving, where the system learns to directly map low level image features to control signals. After the initial training period, the system seamlessly continues autonomously. This extended Hebbian algorithm, qHebb, has constant bounds on time and memory complexity for training and evaluation, independent of the number of training samples presented to the system. Further, the proposed algorithm compares favorably to state of the art engineered batch learning algorithms.
@inproceedings{diva2:750039,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Biologically Inspired Online Learning of Visual Autonomous Driving}},
booktitle = {Proceedings British Machine Vision Conference 2014},
year = {2014},
pages = {137--156},
publisher = {BMVA Press},
}
We propose an algorithm that can capture sharp, low-noise images in low-light conditions on a hand-held smartphone. We make use of the recent ability to acquire bursts of high resolution images on high-end models such as the iPhone5s. Frames are aligned, or stacked, using rolling shutter correction, based on motion estimated from the built-in gyro sensors and image feature tracking. After stacking, the images may be combined, using e.g. averaging to produce a sharp, low-noise photo. We have tested the algorithm on a variety of different scenes, using several different smartphones. We compare our method to denoising, direct stacking, as well as a global-shutter based stacking, with favourable results.
@inproceedings{diva2:729193,
author = {Ringaby, Erik and Forss\'{e}n, Per-Erik},
title = {{A Virtual Tripod for Hand-held Video Stacking on Smartphones}},
booktitle = {2014 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL PHOTOGRAPHY (ICCP)},
year = {2014},
series = {IEEE International Conference on Computational Photography},
publisher = {IEEE},
}
Visual tracking is a challenging problem in computer vision. Most state-of-the-art visual trackers either rely on luminance information or use simple color representations for image description. Contrary to visual tracking, for object recognition and detection, sophisticated color features when combined with luminance have shown to provide excellent performance. Due to the complexity of the tracking problem, the desired color feature should be computationally efficient, and possess a certain amount of photometric invariance while maintaining high discriminative power.
This paper investigates the contribution of color in a tracking-by-detection framework. Our results suggest that color attributes provides superior performance for visual tracking. We further propose an adaptive low-dimensional variant of color attributes. Both quantitative and attributebased evaluations are performed on 41 challenging benchmark color sequences. The proposed approach improves the baseline intensity-based tracker by 24% in median distance precision. Furthermore, we show that our approach outperforms state-of-the-art tracking methods while running at more than 100 frames per second.
@inproceedings{diva2:711538,
author = {Danelljan, Martin and Shahbaz Khan, Fahad and Felsberg, Michael and van de Weijer, Joost},
title = {{Adaptive Color Attributes for Real-Time Visual Tracking}},
booktitle = {Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014},
year = {2014},
series = {IEEE Conference on Computer Vision and Pattern Recognition. Proceedings},
pages = {1090--1097},
publisher = {IEEE Computer Society},
}
In this work we propose a novel non-linear diffusion filtering approach for images based on their channel representation. To derive the diffusion update scheme we formulate a novel energy functional using a soft-histogram representation of image pixel neighborhoods obtained from the channel encoding. The resulting Euler-Lagrange equation yields a non-linear robust diffusion scheme with additional weighting terms stemming from the channel representation which steer the diffusion process. We apply this novel energy formulation to image reconstruction problems, showing good performance in the presence of mixtures of Gaussian and impulse-like noise, e.g. missing data. In denoising experiments of common scalar-valued images our approach performs competitive compared to other diffusion schemes as well as state-of-the-art denoising methods for the considered noise types.
@inproceedings{diva2:690190,
author = {Heinemann, Christian and Åström, Freddie and Baravdish, George and Krajsek, Kai and Felsberg, Michael and Scharr, Hanno},
title = {{Using Channel Representations in Regularization Terms:
A Case Study on Image Diffusion}},
booktitle = {Proceedings of the 9th International Conference on Computer Vision Theory and Applications},
year = {2014},
pages = {48--55},
publisher = {SciTePress},
}
Visual tracking has attracted a significant attention in the last few decades. The recent surge in the number of publications on tracking-related problems have made it almost impossible to follow the developments in the field. One of the reasons is that there is a lack of commonly accepted annotated data-sets and standardized evaluation protocols that would allow objective comparison of different tracking methods. To address this issue, the Visual Object Tracking (VOT) workshop was organized in conjunction with ICCV2013. Researchers from academia as well as industry were invited to participate in the first VOT2013 challenge which aimed at single-object visual trackers that do not apply pre-learned models of object appearance (model-free). Presented here is the VOT2013 benchmark dataset for evaluation of single-object visual trackers as well as the results obtained by the trackers competing in the challenge. In contrast to related attempts in tracker benchmarking, the dataset is labeled per-frame by visual attributes that indicate occlusion, illumination change, motion change, size change and camera motion, offering a more systematic comparison of the trackers. Furthermore, we have designed an automated system for performing and evaluating the experiments. We present the evaluation protocol of the VOT2013 challenge and the results of a comparison of 27 trackers on the benchmark dataset. The dataset, the evaluation tools and the tracker rankings are publicly available from the challenge website(1).
@inproceedings{diva2:1082694,
author = {Kristan, Matej and Pflugfelder, Roman and Leonardis, Ales and Matas, Jiri and Porikli, Fatih and Cehovin, Luka and Nebehay, Georg and Fernandez, Gustavo and Vojir, Tomas and Gatt, Adam and Khajenezhad, Ahmad and Salahledin, Ahmed and Soltani-Farani, Ali and Zarezade, Ali and Petrosino, Alfredo and Milton, Anthony and Bozorgtabar, Behzad and Li, Bo and Seng Chan, Chee and Heng, CherKeng and Ward, Dale and Kearney, David and Monekosso, Dorothy and Can Karaimer, Hakki and Rabiee, Hamid R. and Zhu, Jianke and Gao, Jin and Xiao, Jingjing and Zhang, Junge and Xing, Junliang and Huang, Kaiqi and Lebeda, Karel and Cao, Lijun and Edoardo Maresca, Mario and Kuan Lim, Mei and ELHelw, Mohamed and Felsberg, Michael and Remagnino, Paolo and Bowden, Richard and Goecke, Roland and Stolkin, Rustam and YueYing Lim, Samantha and Maher, Sara and Poullot, Sebastien and Wong, Sebastien and Satoh, Shinichi and Chen, Weihua and Hu, Weiming and Zhang, Xiaoqin and Li, Yang and Niu, ZhiHeng},
title = {{The Visual Object Tracking VOT2013 challenge results}},
booktitle = {2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW)},
year = {2013},
pages = {98--111},
publisher = {IEEE},
}
@inproceedings{diva2:787560,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Integrating Learning and Optimization for Active Vision Inverse Kinematics}},
booktitle = {Proceedings of SSBA, Swedish Symposium on Image Analysis, 2013},
year = {2013},
}
Color description is a challenging task because of large variations in RGB values which occur due to scene accidental events, such as shadows, shading, specularities, illuminant color changes, and changes in viewing geometry. Traditionally, this challenge has been addressed by capturing the variations in physics-based models, and deriving invariants for the undesired variations. The drawback of this approach is that sets of distinguishable colors in the original color space are mapped to the same value in the photometric invariant space. This results in a drop of discriminative power of the color description. In this paper we take an information theoretic approach to color description. We cluster color values together based on their discriminative power in a classification problem. The clustering has the explicit objective to minimize the drop of mutual information of the final representation. We show that such a color description automatically learns a certain degree of photometric invariance. We also show that a universal color representation, which is based on other data sets than the one at hand, can obtain competing performance. Experiments show that the proposed descriptor outperforms existing photometric invariants. Furthermore, we show that combined with shape description these color descriptors obtain excellent results on four challenging datasets, namely, PASCAL VOC 2007, Flowers-102, Stanford dogs-120 and Birds-200.
@inproceedings{diva2:707470,
author = {Khan, Rahat and Van de Weijer, Joost and Khan, Fahad Shahbaz and Muselet, Damien and Ducottet, Christophe and Barat, Cecile},
title = {{Discriminative Color Descriptors}},
booktitle = {Computer Vision and Pattern Recognition (CVPR), 2013},
year = {2013},
series = {IEEE Conference on Computer Vision and Pattern Recognition. Proceedings},
pages = {2866--2873},
publisher = {IEEE Computer Society},
}
State-of-the-art texture descriptors typically operate on grey scale images while ignoring color information. A common way to obtain a joint color-texture representation is to combine the two visual cues at the pixel level. However, such an approach provides sub-optimal results for texture categorisation task.
In this paper we investigate how to optimally exploit color information for texture recognition. We evaluate a variety of color descriptors, popular in image classification, for texture categorisation. In addition we analyze different fusion approaches to combine color and texture cues. Experiments are conducted on the challenging scenes and 10 class texture datasets. Our experiments clearly suggest that in all cases color names provide the best performance. Late fusion is the best strategy to combine color and texture. By selecting the best color descriptor with optimal fusion strategy provides a gain of 5% to 8% compared to texture alone on scenes and texture datasets.
@inproceedings{diva2:707460,
author = {Khan, Fahad Shahbaz and Van de Weijer, Joost and Ali, Sadiq and Felsberg, Michael},
title = {{Evaluating the Impact of Color on Texture Recognition}},
booktitle = {Computer Analysis of Images and Patterns},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {8047},
pages = {154--162},
publisher = {Springer Berlin/Heidelberg},
}
Visual tracking of objects under varying lighting conditions and changes of the object appearance, such as articulation and change of aspect, is a challenging problem. Due to its robustness and speed, distribution field tracking is among the state-of-the-art approaches for tracking objects with constant size in grayscale sequences. According to the theory of averaged shifted histograms, distribution fields are an approximation of kernel density estimates. Another, more efficient approximation are channel representations, which are used in the present paper to derive an enhanced computational scheme for tracking. This enhanced distribution field tracking method outperforms several state-ofthe-art methods on the VOT2013 challenge, which evaluates accuracy, robustness, and speed.
@inproceedings{diva2:662687,
author = {Felsberg, Michael},
title = {{Enhanced Distribution Field Tracking using Channel Representations}},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), 2013},
year = {2013},
pages = {121--128},
publisher = {IEEE conference proceedings},
}
For navigation of unmanned aerial vehicles (UAVs), attitude estimation is essential. We present a method for attitude estimation (pitch and roll angle) from aerial fisheye images through horizon detection. The method is based on edge detection and a probabilistic Hough voting scheme. In a flight scenario, there is often some prior knowledge of the vehicle altitude and attitude. We exploit this prior to make the attitude estimation more robust by letting the edge pixel votes be weighted based on the probability distributions for the altitude and pitch and roll angles. The method does not require any sky/ground segmentation as most horizon detection methods do. Our method has been evaluated on aerial fisheye images from the internet. The horizon is robustly detected in all tested images. The deviation in the attitude estimate between our automated horizon detection and a manual detection is less than 1 degree.
@inproceedings{diva2:651774,
author = {Grelsson, Bertil and Felsberg, Michael},
title = {{Probabilistic Hough Voting for Attitude Estimation from Aerial Fisheye Images}},
booktitle = {Image Analysis},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {7944},
pages = {478--488},
publisher = {Springer Berlin/Heidelberg},
}
The development of vehicles that perceive their environment, in particular those using computer vision, indispensably requires large databases of sensor recordings obtained from real cars driven in realistic traffic situations. These datasets should be time shaped for enabling synchronization of sensor data from different sources. Furthermore, full surround environment perception requires high frame rates of synchronized omnidirectional video data to prevent information loss at any speeds.
This paper describes an experimental setup and software environment for recording such synchronized multi-sensor data streams and storing them in a new open source format. The dataset consists of sequences recorded in various environments from a car equipped with an omnidirectional multi-camera, height sensors, an IMU, a velocity sensor, and a GPS. The software environment for reading these data sets will be provided to the public, together with a collection of long multi-sensor and multi-camera data streams stored in the developed format.
@inproceedings{diva2:623885,
author = {Koschorrek, Philipp and Piccini, Tommaso and Öberg, Per and Felsberg, Michael and Nielsen, Lars and Mester, Rudolf},
title = {{A multi-sensor traffic scene dataset with omnidirectional video}},
booktitle = {2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW)},
year = {2013},
series = {IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops},
pages = {727--734},
publisher = {IEEE conference proceedings},
}
We analyze the consequences of instabilities and fluctuations, such as camera shaking and illumination/exposure changes, on typical surveillance video material and devise a systematic way to compensate these changes as much as possible. The phase correlation method plays a decisive role in the proposed scheme, since it is inherently insensitive to gain and offset changes, as well as insensitive against different linear degradations (due to time-variant motion blur) in subsequent images. We show that the listed variations can be compensated effectively, and the image data can be equilibrated significantly before a temporal change detection and/or a background-based detection is performed. We verify the usefulness of the method by comparative tests with and without stabilization, using the changedetection.net benchmark and several state-of-the-art detections methods.
@inproceedings{diva2:623605,
author = {Eisenbach, Jens and Mertz, Matthias and Conrad, Christian and Mester, Rudolf},
title = {{Reducing Camera Vibrations and Photometric Changes in Surveillance Video}},
booktitle = {10th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), August 27-30, Krakow, Poland},
year = {2013},
pages = {69--74},
publisher = {IEEE},
}
We propose and evaluate a versatile scheme for image pre-segmentation that generates a partition of the image into a selectable number of patches (’superpixels’), under the constraint of obtaining maximum homogeneity of the ’texture’ inside of each patch, and maximum accordance of the contours with both the image content as well as a Gibbs-Markov random field model. In contrast to current state-of-the art approaches to superpixel segmentation, ’homogeneity’ does not limit itself to smooth region-internal signals and high feature value similarity between neighboring pixels, but is applicable also to highly textured scenes. The energy functional that is to be maximized for this purpose has only a very small number of design parameters, depending on the particular statistical model used for the images.
The capability of the resulting partitions to deform according to the image content can be controlled by a single parameter. We show by means of an extensive comparative experimental evaluation that the compactness-controlled contour-relaxed superpixels method outperforms the state-of-the art superpixel algorithms with respect to boundary recall and undersegmentation error while being faster or on a par with respect to runtime.
@inproceedings{diva2:623602,
author = {Conrad, Christian and Mertz, Matthias and Mester, Rudolf},
title = {{Contour-relaxed Superpixels}},
booktitle = {EMMCVPR 2013. 9th International Conference Energy Minimization Methods in Computer Vision and Pattern Recognition, August 19-21, Lund, Sweden},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {8081},
pages = {280--293},
publisher = {Springer Berlin/Heidelberg},
}
In this work we present an approach to automatically learn pixel correspondences between pairs of cameras. We build on the method of Temporal Coincidence Analysis (TCA) and extend it from the pure temporal (i.e. single-pixel) to the spatiotemporal domain. Our approach is based on learning a statistical model for local spatiotemporal image patches, determining rare, and expressive events from this model, and matching these events across multiple views. Accumulating multi-image coincidences of such events over time allows to learn the desired geometric and photometric relations. The presented method also works for strongly different viewpoints and camera settings, including substantial rotation, and translation. The only assumption that is made is that the relative orientation of pairs of cameras may be arbitrary, but fixed, and that the observed scene shows visual activity. We show that the proposed method outperforms the single pixel approach to TCA both in terms of learning speed and accuracy.
@inproceedings{diva2:615201,
author = {Mester, Rudolf and Conrad, Christian},
title = {{Learning Multi-View Correspondences via Subspace-Based Temporal Coincidences}},
booktitle = {Proceeding Scandinavian Conference on Image Analysis, 2013},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {7944},
pages = {456--467},
publisher = {Springer Berlin/Heidelberg},
}
We introduce a method to combine the color channels of an image to a scalar valued image. Linear combinations of the RGB channels are constructed using the Fisher-Trace-Information (FTI), defined as the trace of the Fisher information matrix of the Weibull distribution, as a cost function. The FTI characterizes the local geometry of the Weibull manifold independent of the parametrization of the distribution. We show that minimizing the FTI leads to contrast enhanced images, suitable for segmentation processes. The Riemann structure of the manifold of Weibull distributions is used to design optimization methods for finding optimal weight RGB vectors. Using a threshold procedure we find good solutions even for images with limited content variation. Experiments show how the method adapts to images with widely varying visual content. Using these image dependent de-colorizations one can obtain substantially improved segmentation results compared to a mapping with pre-defined coefficients.
@inproceedings{diva2:607078,
author = {Lenz, Reiner and Zografos, Vasileios},
title = {{Fisher Information and the Combination of RGB channels}},
booktitle = {4th International Workshop, CCIW 2013, Chiba, Japan, March 3-5, 2013. Proceedings},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {7786},
pages = {250--264},
publisher = {Springer Berlin/Heidelberg},
}
In this work we derive a novel density driven diffusion scheme for image enhancement. Our approach, called D3, is a semi-local method that uses an initial structure-preserving oversegmentation step of the input image. Because of this, each segment will approximately conform to a homogeneous region in the image, allowing us to easily estimate parameters of the underlying stochastic process thus achieving adaptive non-linear filtering. Our method is capable of producing competitive results when compared to state-of-the-art methods such as non-local means, BM3D and tensor driven diffusion on both color and grayscale images.
@inproceedings{diva2:611186,
author = {Åström, Freddie and Zografos, Vasileios and Felsberg, Michael},
title = {{Density Driven Diffusion}},
booktitle = {18th Scandinavian Conferences on Image Analysis, 2013},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {7944},
pages = {718--730},
}
The assessment of image denoising results depends on the respective application area, i.e. image compression, still-image acquisition, and medical images require entirely different behavior of the applied denoising method. In this paper we propose a novel, nonlinear diffusion scheme that is derived from a linear diffusion process in a value space determined by the application. We show that application-driven linear diffusion in the transformed space compares favorably with existing nonlinear diffusion techniques.
@inproceedings{diva2:608779,
author = {Åström, Freddie and Felsberg, Michael and Baravdish, George and Lundström, Claes},
title = {{Targeted Iterative Filtering}},
booktitle = {Fourth International Conference on Scale Space and Variational Methods in Computer Vision (SSVM 2013), 2-6 June 2013, Schloss Seggau, Graz region, Austria},
year = {2013},
series = {Lecture Notes in Computer Science},
volume = {7893},
pages = {1--11},
publisher = {Springer Berlin/Heidelberg},
}
Robust estimation of the relative pose between two cameras is a fundamental part of Structure and Motion methods. For calibrated cameras, the five point method together with a robust estimator such as RANSAC gives the best result in most cases. The current state-of-the-art method for solving the relative pose problem from five points is due to Nistér [9], because it is faster than other methods and in the RANSAC scheme one can improve precision by increasing the number of iterations. In this paper, we propose a new iterative method, which is based on Powell's Dog Leg algorithm. The new method has the same precision and is approximately twice as fast as Nister's algorithm. The proposed method is easily extended to more than five points while retaining a efficient error metrics. This makes it also very suitable as an refinement step. The proposed algorithm is systematically evaluated on three types of datasets with known ground truth.
@inproceedings{diva2:612068,
author = {Hedborg, Johan and Felsberg, Michael},
title = {{Fast Iterative Five point Relative Pose Estimation}},
booktitle = {IEEE Workshop on Robot Vision (WoRV 2013), January 15-17, 2013, Clearwater, FL, USA},
year = {2013},
pages = {60--67},
publisher = {IEEE conference proceedings},
}
We present a novel method for clustering data drawn from a union of arbitrary dimensional subspaces, called Discriminative Subspace Clustering (DiSC). DiSC solves the subspace clustering problem by using a quadratic classifier trained from unlabeled data (clustering by classification). We generate labels by exploiting the locality of points from the same subspace and a basic affinity criterion. A number of classifiers are then diversely trained from different partitions of the data, and their results are combined together in an ensemble, in order to obtain the final clustering result. We have tested our method with 4 challenging datasets and compared against 8 state-of-the-art methods from literature. Our results show that DiSC is a very strong performer in both accuracy and robustness, and also of low computational complexity.
@inproceedings{diva2:610663,
author = {Zografos, Vasileios and Ellis, Liam and Mester, Rudolf},
title = {{Discriminative Subspace Clustering}},
booktitle = {26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), June 23-28, 2013, Portland, Oregon, USA},
year = {2013},
series = {2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)},
}
A method for online global pose estimation of aerial images by alignment with a georeferenced 3D model is presented.Motion stereo is used to reconstruct a dense local height patch from an image pair. The global pose is inferred from the 3D transform between the local height patch and the model.For efficiency, the sought 3D similarity transform is found by least-squares minimizations of three 2D subproblems.The method does not require any landmarks or reference points in the 3D model, but an approximate initialization of the global pose, in our case provided by onboard navigation sensors, is assumed.Real aerial images from helicopter and aircraft flights are used to evaluate the method. The results show that the accuracy of the position and orientation estimates is significantly improved compared to the initialization and our method is more robust than competing methods on similar datasets.The proposed matching error computed between the transformed patch and the map clearly indicates whether a reliable pose estimate has been obtained.
@inproceedings{diva2:607988,
author = {Grelsson, Bertil and Felsberg, Michael and Isaksson, Folke},
title = {{Efficient 7D Aerial Pose Estimation}},
booktitle = {2013 IEEE Workshop on Robot Vision (WORV)},
year = {2013},
pages = {88--95},
publisher = {IEEE},
}
An online method for rapidly learning the inverse kinematics of a redundant robotic arm is presented addressing the special requirements of active vision for visual inspection tasks. The system is initialized with a model covering a small area around the starting position, which is then incrementally extended by exploration. The number of motions during this process is minimized by only exploring configurations required for successful completion of the task at hand. The explored area is automatically extended online and on demand.To achieve this, state of the art methods for learning and numerical optimization are combined in a tight implementation where parts of the learned model, the Jacobians, are used during optimization, resulting in significant synergy effects. In a series of standard experiments, we show that the integrated method performs better than using both methods sequentially.
@inproceedings{diva2:606285,
author = {Öfjäll, Kristoffer and Michael, Felsberg},
title = {{Rapid Explorative Direct Inverse Kinematics Learning of Relevant Locations for Active Vision}},
booktitle = {IEEE Workshop on Robot Vision(WORV) 2013},
year = {2013},
pages = {14--19},
publisher = {IEEE conference proceedings},
}
Many RGB-D sensors, e.g. the Microsoft Kinect, use rolling shutter cameras. Such cameras produce geometrically distorted images when the sensor is moving. To mitigate these rolling shutter distortions we propose a method that uses an attached gyroscope to rectify the depth scans. We also present a simple scheme to calibrate the relative pose and time synchronization between the gyro and a rolling shutter RGB-D sensor. We examine the effectiveness of our rectification scheme by coupling it with the the Kinect Fusion algorithm. By comparing Kinect Fusion models obtained from raw sensor scans and from rectified scans, we demonstrate improvement for three classes of sensor motion: panning motions causes slant distortions, and tilt motions cause vertically elongated or compressed objects. For wobble we also observe a loss of detail, compared to the reconstruction using rectified depth scans. As our method relies on gyroscope readings, the amount of computations required is negligible compared to the cost of running Kinect Fusion.
@inproceedings{diva2:603474,
author = {Ovr\'{e}n, Hannes and Forss\'{e}n, Per-Erik and Törnqvist, David},
title = {{Why Would I Want a Gyroscope on my RGB-D Sensor?}},
booktitle = {Proceedings of 2013 IEEE Workshop on Robot Vision (WORV)},
year = {2013},
pages = {68--75},
publisher = {IEEE},
}
This paper presents an autonomous robotic system that incorporates novel Computer Vision, Machine Learning and Data Mining algorithms in order to learn to navigate and discover important visual entities. This is achieved within a Learning from Demonstration (LfD) framework, where policies are derived from example state-to-action mappings. For autonomous navigation, a mapping is learnt from holistic image features (GIST) onto control parameters using Random Forest regression. Additionally, visual entities (road signs e.g. STOP sign) that are strongly associated to autonomously discovered modes of action (e.g. stopping behaviour) are discovered through a novel Percept-Action Mining methodology. The resulting sign detector is learnt without any supervision (no image labeling or bounding box annotations are used). The complete system is demonstrated on a fully autonomous robotic platform, featuring a single camera mounted on a standard remote control car. The robot carries a PC laptop, that performs all the processing on board and in real-time.
@inproceedings{diva2:575736,
author = {Ellis, Liam and Pugeault, Nicolas and Öfjäll, Kristoffer and Hedborg, Johan and Bowden, Richard and Felsberg, Michael},
title = {{Autonomous Navigation and Sign Detector Learning}},
booktitle = {IEEE Workshop on Robot Vision(WORV) 2013},
year = {2013},
pages = {144--151},
publisher = {IEEE},
}
Face tracking is an extensively studied field. Nevertheless, it is still a challenge to make a robust and efficient face tracker, especially on mobile devices. This extended abstract briefly describes our implementation of a high-performance multi-platform face and facial feature tracking system. The main characteristics of our approach are that the tracker is fully automatic and works with the majority of faces without any manual initialization. It is robust, resistant to rapid changes in pose and facial expressions, does not suffer from drifting and is modestly computationally expensive. The tracker runs in real-time on mobile devices.
@inproceedings{diva2:845459,
author = {Marku\v{s}, Nenad and Frljak, Miroslav and Pandži\'{c}, Igor and Ahlberg, Jörgen and Forchheimer, Robert},
title = {{High-performance face tracking}},
booktitle = {ACM 3rd International Symposium on Facial Analysis and Animation},
year = {2012},
}
The labyrinth game is a simple yet challenging platform, not only for humans but also for control algorithms and systems. The game is easy to understand but still very hard to master. From a system point of view, the ball behavior is in general easy to model but close to the obstacles there are severe non-linearities. Additionally, the far from flat surface on which the ball rolls provides for changing dynamics depending on the ball position.
The general dynamics of the system can easily be handled by traditional automatic control methods. Taking the obstacles and uneven surface into account would require very detailed models of the system. A simple deterministic control algorithm is combined with a learning control method. The simple control method provides initial training data. As thelearning method is trained, the system can learn from the results of its own actions and the performance improves well beyond the performance of the initial controller.
A vision system and image analysis is used to estimate the ball position while a combination of a PID controller and a learning controller based on LWPR is used to learn to steer the ball through the maze.
@inproceedings{diva2:750037,
author = {Öfjäll, Kristoffer and Felsberg, Michael},
title = {{Combining Vision, Machine Learning and Automatic Control to Play the Labyrinth Game}},
booktitle = {Proceedings of SSBA, Swedish Symposium on Image Analysis, 2012},
year = {2012},
}
In the recent years, advanced video sensors have become common in driver assistance, coping with the highly dynamic lighting conditions by nonlinear exposure adjustments. However, many computer vision algorithms are still highly sensitive to the resulting sudden brightness changes. We present a method that is able to estimate the relative intensity transfer function (RITF) between images in a sequence even for moving cameras. The according compensation of the input images can improve the performance of further vision tasks significantly, here demonstrated by results from optical flow. Our method identifies corresponding intensity values from areas in the images where no apparent motion is present. The RITF is then estimated from that data and regularized based on its curvature. Finally, built-in tests reliably flag image pairs with adverse conditions where no compensation could be performed. © 2012 IEEE.
@inproceedings{diva2:665610,
author = {Dederscheck, David and Muller, T. and Mester, Rudolf},
title = {{Illumination invariance for driving scene optical flow using comparagram preselection}},
booktitle = {IEEE Intelligent Vehicles Symposium (IV), Proceedings},
year = {2012},
series = {IEEE Intelligent Vehicles Symposium, Proceedings},
volume = {4},
pages = {742--747},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
Complementary information, when combined in the right way, is capable of improving clustering and segmentation problems. In this paper, we show how it is possible to enhance motion segmentation accuracy with a very simple and inexpensive combination of complementary information, which comes from the column and row spaces of the same measurement matrix. We test our approach on the Hopkins155 dataset where it outperforms all other state-of-the-art methods.
@inproceedings{diva2:572612,
author = {Zografos, Vasileios},
title = {{Enhancing motion segmentation by combination of complementary affinities}},
booktitle = {Proceedings of the 21st Internationa Conference on Pattern Recognition},
year = {2012},
pages = {2198--2201},
}
Combining the channels of a multi-band image with the help of a pixelwise weighted sum is one of the basic operations in color and multispectral image processing. A typical example is the conversion of RGB- to intensity images. Usually the weights are given by some standard values or chosen heuristically. This does not take into account neither the statistical nature of the image source nor the intended further processing of the scalar image. In this paper we will present a framework in which we specify the statistical properties of the input data with the help of a representative collection of image patches. On the output side we specify the intended processing of the scalar image with the help of a filter kernel with zero-mean filter coefficients. Given the image patches and the filter kernel we use the Fisher information of the manifold of two-parameter Weibull distributions to introduce the trace of the Fisher information matrix as a cost function on the space of weight vectors of unit length. We will illustrate the properties of the method with the help of a database of scanned leaves and some color images from the internet. For the green leaves we find that the result of the mapping is similar to standard mappings like Matlab’s RGB2Gray weights. We then change the colour of the leaf using a global shift in the HSV representation of the original image and show how the proposed mapping adapts to this color change. This is also confirmed with other natural images where the new mapping reveals much more subtle details in the processed image. In the last experiment we show that the mapping emphasizes visually salient points in the image whereas the standard mapping only emphasizes global intensity changes. The proposed approach to RGB filter design provides thus a new methodology based only on the properties of the image statistics and the intended post-processing. It adapts to color changes of the input images and, due to its foundation in the statistics of extreme-value distributions, it is suitable for detecting salient regions in an image.
@inproceedings{diva2:529514,
author = {Lenz, Reiner and Zografos, Vasileios},
title = {{RGB Filter design using the properties of the weibull manifold}},
booktitle = {CGIV 2012 Sixth European Conference on Colour in Graphics, Imaging, and Vision},
year = {2012},
pages = {200--205},
address = {Springfield, VA},
}
State-of-the-art object detectors typically use shape information as a low level feature representation to capture the local structure of an object. This paper shows that early fusion of shape and color, as is popular in image classification, leads to a significant drop in performance for object detection. Moreover, such approaches also yields suboptimal results for object categories with varying importance of color and shape. In this paper we propose the use of color attributes as an explicit color representation for object detection. Color attributes are compact, computationally efficient, and when combined with traditional shape features provide state-of-the-art results for object detection. Our method is tested on the PASCAL VOC 2007 and 2009 datasets and results clearly show that our method improves over state-of-the-art techniques despite its simplicity. We also introduce a new dataset consisting of cartoon character images in which color plays a pivotal role. On this dataset, our approach yields a significant gain of 14% in mean AP over conventional state-of-the-art methods.
@inproceedings{diva2:600948,
author = {Khan, Fahad Shahbaz and Anwer, Rao Muhammad and van de Weijer, Joost and Bagdanov, Andrew D. and Vanrell, Maria and Lopez, Antonio M.},
title = {{Color Attributes for Object Detection}},
booktitle = {Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2012},
year = {2012},
pages = {3306--3313},
publisher = {IEEE},
}
This work addresses the problem of fast, online segmentationof moving objects in video. We pose this as a discriminative onlinesemi-supervised appearance learning task, where supervising labelsare autonomously generated by a motion segmentation algorithm. Thecomputational complexity of the approach is signicantly reduced byperforming learning and classication on oversegmented image regions(superpixels), rather than per pixel. In addition, we further exploit thesparse trajectories from the motion segmentation to obtain a simplemodel that encodes the spatial properties and location of objects at eachframe. Fusing these complementary cues produces good object segmentationsat very low computational cost. In contrast to previous work,the proposed approach (1) performs segmentation on-the-y (allowingfor applications where data arrives sequentially), (2) has no prior modelof object types or `objectness', and (3) operates at signicantly reducedcomputational cost. The approach and its ability to learn, disambiguateand segment the moving objects in the scene is evaluated on a numberof benchmark video sequences.
@inproceedings{diva2:575721,
author = {Ellis, Liam and Zografos, Vasileios},
title = {{Online Learning for Fast Segmentation of Moving Objects}},
booktitle = {ACCV 2012},
year = {2012},
series = {Lecture Notes in Computer Science},
volume = {7725},
pages = {52--65},
publisher = {Springer Berlin/Heidelberg},
}
This paper describes a method for generation of dense stereo ground-truth using a consumer depth sensor such as the Microsoft Kinect. Such ground-truth allows adaptation of stereo algorithms to a specific setting. The method uses a novel residual weighting based on error propagation from image plane measurements to 3D. We use this ground-truth in wide-angle stereo learning by automatically tuning a novel extension of the best-first-propagation (BFP) dense correspondence algorithm. We extend BFP by adding a coarse-to-fine scheme, and a structure measure that limits propagation along linear structures and flat areas. The tuned correspondence algorithm is evaluated in terms of accuracy, robustness, and ability to generalise. Both the tuning cost function, and the evaluation are designed to balance the accuracy-robustness trade-off inherent in patch-based methods such as BFP.
@inproceedings{diva2:551483,
author = {Wallenberg, Marcus and Forss\'{e}n, Per-Erik},
title = {{Teaching Stereo Perception to YOUR Robot}},
booktitle = {Proceedings of 23rd British Machine Vision Conference},
year = {2012},
pages = {1--12},
publisher = {University of Surrey, UK},
}
The case when a partial differential equation (PDE) can be considered as an Euler-Lagrange (E-L) equation of an energy functional, consisting of a data term and a smoothness term is investigated. We show the necessary conditions for a PDE to be the E-L equation for a corresponding functional. This energy functional is applied to a color image denoising problem and it is shown that the method compares favorably to current state-of-the-art color image denoising techniques.
@inproceedings{diva2:543914,
author = {Åström, Freddie and Baravdish, George and Felsberg, Michael},
title = {{On Tensor-Based PDEs and their Corresponding Variational Formulations with Application to Color Image Denoising}},
booktitle = {ECCV 2012: 12th European Conference on Computer Vision, 7-12 October, Firenze, Italy},
year = {2012},
series = {Lecture Notes in Computer Science},
volume = {7574},
pages = {215--228},
publisher = {Springer Berlin/Heidelberg},
}
@inproceedings{diva2:535824,
author = {Magnusson, Maria and Dahlqvist Leinhard, Olof and van Ettinger-Veenstra, Helene and Lundberg, Peter},
title = {{FMRI Using 3D PRESTO-CAN - A Novel Method Based on Golden Angle Hybrid Radial-Cartesian Sampling of K-Space}},
booktitle = {ISMRM, Melbourne, Australia, 5-11 May, 2012},
year = {2012},
}
This paper introduces a bundle adjustment (BA) method that obtains accurate structure and motion from rolling shutter (RS) video sequences: RSBA. When a classical BA algorithm processes a rolling shutter video, the resultant camera trajectory is brittle, and complete failures are not uncommon. We exploit the temporal continuity of the camera motion to define residuals of image point trajectories with respect to the camera trajectory. We compare the camera trajectories from RSBA to those from classical BA, and from classical BA on rectified videos. The comparisons are done on real video sequences from an iPhone 4, with ground truth obtained from a global shutter camera, rigidly mounted to the iPhone 4. Compared to classical BA, the rolling shutter model requires just six extra parameters. It also degrades the sparsity of the system Jacobian slightly, but as we demonstrate, the increase in computation time is moderate. Decisive advantages are that RSBA succeeds in cases where competing methods diverge, and consistently produces more accurate results.
@inproceedings{diva2:517591,
author = {Hedborg, Johan and Forss\'{e}n, Per-Erik and Felsberg, Michael and Ringaby, Erik},
title = {{Rolling Shutter Bundle Adjustment}},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012},
year = {2012},
series = {Computer Vision and Pattern Recognition},
pages = {1434--1441},
publisher = {IEEE Computer Society; 1999},
}
In this paper, we present a model-based video coding method that uses input from colour and depth cameras, such as the Microsoft Kinect. The model-based approach uses a 3D representation of the scene, enabling several other applications besides video playback. Some of these applications are stereoscopic viewing, object insertion for augmented reality and free viewpoint viewing. The video encoding step uses computer vision to estimate the camera motion. The scene geometry is represented by keyframes, which are encoded as 3D quadsusing a quadtree, allowing good compression rates. Camera motion in-between keyframes is approximated to be linear. The relative camera positions at keyframes and the scene geometry are then compressed and transmitted to the decoder. Our experiments demonstrate that the model-based approach delivers a high level of detail at competitively low bitrates.
@inproceedings{diva2:525249,
author = {Sandberg, David and Forss\'{e}n, Per-Erik and Ogniewski, Jens},
title = {{Model-Based Video Coding using Colour and Depth Cameras}},
booktitle = {Digital Image Computing},
year = {2011},
pages = {158--163},
publisher = {IEEE},
}
Structured light range sensors, such as the Microsoft Kinect, have recently become popular as perception devices for computer vision and robotic systems. These sensors use CMOS imaging chips with electronic rolling shutters (ERS). When using such a sensor on a moving platform, both the image, and the depth map, will exhibit geometric distortions. We introduce an algorithm that can suppress such distortions, by rectifying the 3D point clouds from the range sensor. This is done by first estimating the time continuous 3D camera trajectory, and then transforming the 3D points to where they would have been, if the camera had been stationary. To ensure that image and range data are synchronous, the camera trajectory is computed from KLT tracks on the structured-light frames, after suppressing the structured-light pattern. We evaluate our rectification, by measuring angles between the visible sides of a cube, before and after rectification. We also measure how much better the 3D point clouds can be aligned after rectification. The obtained improvement is also related to the actual rotational velocity, measured using a MEMS gyroscope.
@inproceedings{diva2:525244,
author = {Ringaby, Erik and Forss\'{e}n, Per-Erik},
title = {{Scan Rectification for Structured Light Range Sensors with Rolling Shutters}},
booktitle = {IEEE International Conference on Computer Vision},
year = {2011},
series = {International Conference on Computer Vision (ICCV)},
pages = {1575--1582},
address = {Barcelona Spain},
}
We present a system that rectifies and stabilizes video sequences on mobile devices with rolling-shutter cameras. The system corrects for rolling-shutter distortions using measurements from accelerometer and gyroscope sensors, and a 3D rotational distortion model. In order to obtain a stabilized video, and at the same time keep most content in view, we propose an adaptive low-pass filter algorithm to obtain the output camera trajectory. The accuracy of the orientation estimates has been evaluated experimentally using ground truth data from a motion capture system. We have conducted a user study, where the output from our system, implemented in iOS, has been compared to that of three other applications, as well as to the uncorrected video. The study shows that users prefer our sensor-based system.
@inproceedings{diva2:525241,
author = {Hanning, Gustav and Forslöw, Nicklas and Forss\'{e}n, Per-Erik and Ringaby, Erik and Törnqvist, David and Callmer, Jonas},
title = {{Stabilizing Cell Phone Video using Inertial Measurement Sensors}},
booktitle = {The Second IEEE International Workshop on Mobile Vision},
year = {2011},
pages = {1--8},
address = {Barcelona Spain},
}
The majority of consumer quality cameras sold today have CMOS sensors with rolling shutters. In a rolling shutter camera, images are read out row by row, and thus each row is exposed during a different time interval. A rolling-shutter exposure causes geometric image distortions when either the camera or the scene is moving, and this causes state-of-the-art structure and motion algorithms to fail. We demonstrate a novel method for solving the structure and motion problem for rolling-shutter video. The method relies on exploiting the continuity of the camera motion, both between frames, and across a frame. We demonstrate the effectiveness of our method by controlled experiments on real video sequences. We show, both visually and quantitatively, that our method outperforms standard structure and motion, and is more accurate and efficient than a two-step approach, doing image rectification and structure and motion.
@inproceedings{diva2:505440,
author = {Hedborg, Johan and Ringaby, Erik and Forss\'{e}n, Per-Erik and Felsberg, Michael},
title = {{Structure and Motion Estimation from Rolling Shutter Video}},
booktitle = {IEEE International Conference onComputer Vision Workshops (ICCV Workshops), 2011},
year = {2011},
pages = {17--23},
publisher = {IEEE Xplore},
}
Quantitative tissue classification using dual-energy CT has the potential to improve accuracy in radiation therapy dose planning as it provides more information about material composition of scanned objects than the currently used methods based on single-energy CT. One problem that hinders successful application of both single-and dualenergy CT is the presence of beam hardening and scatter artifacts in reconstructed data. Current pre-and post-correction methods used for image reconstruction often bias CT numbers and thus limit their applicability for quantitative tissue classification. Here we demonstrate simulation studies with a novel iterative algorithm that decomposes every soft tissue voxel into three base materials: water, protein and adipose. The results demonstrate that beam hardening artifacts can effectively be removed and accurate estimation of mass fractions of all base materials can be achieved. In the future, the algorithm may be developed further to include segmentation of soft and bone tissue and subsequent bone decomposition, extension from 2-D to 3-D and scatter correction.
@inproceedings{diva2:506005,
author = {Magnusson, Maria and Malusek, Alexandr and Muhammad, Arif and Alm Carlsson, Gudrun},
title = {{Iterative Reconstruction for QuantitativeTissue Decomposition in Dual-Energy CT}},
booktitle = {Proceedings of the 17th Scandinavian Conference, SCIA 2011, Ystad, Sweden, May 2011.},
year = {2011},
series = {Lecture Notes in Computer Science},
volume = {6688},
pages = {479--488},
publisher = {Springer Berlin/Heidelberg},
}
@inproceedings{diva2:506294,
author = {Magnusson, Maria and Malusek, Alexandr and Muhammad, Arif and Alm Carlsson, Gudrun},
title = {{Determination of Quantitative Tissue Composition by Iterative Reconstruction on 3D DECT Volumes}},
booktitle = {Proc 11:th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine, Potsdam, Germany},
year = {2011},
}
@inproceedings{diva2:475908,
author = {Magnusson, Maria and Dahlqvist Leinhard, Olof and Lundberg, Peter},
title = {{A 3D-Plus-Time Radial-Cartesian Hybrid Sampling of K-Space With High Temporal Resolution and Maintained Image Quality for MRI and FMRI}},
booktitle = {ISMRM, Montreal 2011},
year = {2011},
}
@inproceedings{diva2:475370,
author = {Ahlman, Gustav and Magnusson, Maria and Dahlqvist Leinhard, Olof and Lundberg, Peter},
title = {{Increased temporal resolution in radial-Cartesian sampling of k-space by implementation of parallel imaging}},
booktitle = {ESMRMB 2011, 28th Annual Scientific Meeting, 6-8 October 2011, Leipzig, Germany},
year = {2011},
publisher = {Springer},
}
@inproceedings{diva2:475363,
author = {Karlsson, Anette and Magnusson, Maria and Dahlqvist Leinhard, Olof and Lundberg, Peter},
title = {{Successful Motion Correction in Reconstruction of Radial MRI}},
booktitle = {ESMRMB, Leipzig 2011},
year = {2011},
}
We use the theory of group representations to construct very fast image descriptors that split the vector space of local RGB distributions into small group-invariant subspaces. These descriptors are group theoretical generalizations of the Fourier Transform and can be computed with algorithms similar to the FFT. Because of their computational efficiency they are especially suitable for retrieval, recognition and classification in very large image datasets. We also show that the statistical properties of these descriptors are governed by the principles of the Extreme Value Theory (EVT). This enables us to work directly with parametric probability distribution models, which offer a much lower dimensionality and higher resolution and flexibility than explore the connection to EVT and analyse the characteristics of these descriptors from a probabilistic viewpoint with the help of large image databases.
@inproceedings{diva2:463658,
author = {Zografos, Vasileios and Lenz, Reiner},
title = {{Spatio-chromatic image content descriptors and their analysis using Extreme Value Theory}},
booktitle = {Image analysis},
year = {2011},
series = {Lecture Notes in Computer Science},
volume = {6688},
pages = {579--591},
publisher = {Springer Berlin/Heidelberg},
}
We introduce a simple and efficient procedure for the segmentation of rigidly moving objects, imaged under an affine camera model. For this purpose we revisit the theory of "linear combination of views" (LCV), proposed by Ullman and Basri [20], which states that the set of 2d views of an object undergoing 3d rigid transformations, is embedded in a low-dimensional linear subspace that is spanned by a small number of basis views. Our work shows, that one may use this theory for motion segmentation, and cluster the trajectories of 3d objects using only two 2d basis views. We therefore propose a practical motion segmentation method, built around LCV, that is very simple to implement and use, and in addition is very fast, meaning it is well suited for real-time SfM and tracking applications. We have experimented on real image sequences, where we show good segmentation results, comparable to the state-of-the-art in literature. If we also consider computational complexity, our proposed method is one of the best performers in combined speed and accuracy. © 2011. The copyright of this document resides with its authors.
@inproceedings{diva2:463659,
author = {Zografos, Vasileios and Nordberg, Klas},
title = {{Fast and accurate motion segmentation using linear combination of views}},
booktitle = {BMVC 2011},
year = {2011},
pages = {12.1--12.11},
}
Segmentation is an important preprocessing step in many applications. Purely colour-based segmentation is often problematic. For this reason, we here investigate fusion of depth and colour information, to facilitate robust segmentation of single images. We evaluate segmentation results on data collected using the Microsoft Kinect peripheral for Xbox 360, using superparamagnetic clustering. We also propose a method for aligning and encoding colour and depth information from the Kinect device. As we show in the paper, the fusion of depth and colour information produces more semantically relevant segments for scene analysis than either depth or colour separately.
@inproceedings{diva2:441482,
author = {Wallenberg, Marcus and Felsberg, Michael and Forss\'{e}n, Per-Erik and Dellen, Babette},
title = {{Leaf Segmentation using the Kinect}},
booktitle = {Proceedings of SSBA 2011 Symposium on Image Analysis},
year = {2011},
}
Segmentation is an important preprocessing step in many applications. Compared to colour segmentation, fusion of colour and depth greatly improves the segmentation result. Such a fusion is easy to do by stacking measurements in different value dimensions, but there are better ways. In this paper we perform fusion using the channel representation, and demonstrate how a state-of-the-art segmentation algorithm can be modified to use channel values as inputs. We evaluate segmentation results on data collected using the Microsoft Kinect peripheral for Xbox 360, using the superparamagnetic clustering algorithm. Our experiments show that depth gradients are more useful than depth values for segmentation, and that channel coding both colour and depth gradients makes tuned parameter settings generalise better to novel images.
@inproceedings{diva2:441260,
author = {Wallenberg, Marcus and Felsberg, Michael and Forss\'{e}n, Per-Erik and Dellen, Babette},
title = {{Channel Coding for Joint Colour and Depth Segmentation}},
booktitle = {Proceedings of Pattern Recognition 33rd DAGM Symposium, Frankfurt/Main, Germany, August 31 - September 2},
year = {2011},
series = {Lecture Notes in Computer Science},
volume = {6835},
pages = {306--315},
publisher = {Springer},
}
This paper presents a method for rectifying video sequences from rolling shutter (RS) cameras. In contrast to previous RS rectification attempts we model distortions as being caused by the 3D motion of the camera. The camera motion is parametrised as a continuous curve, with knots at the last row of each frame. Curve parameters are solved for using non-linear least squares over inter-frame correspondences obtained from a KLT tracker. We have generated synthetic RS sequences with associated ground-truth to allow controlled evaluation. Using these sequences, we demonstrate that our algorithm improves over two previously published methods. The RS dataset is available on the web to allow comparison with other methods.
@inproceedings{diva2:441252,
author = {Ringaby, Erik and Forss\'{e}n, Per-Erik},
title = {{Rectifying rolling shutter video from hand-held devices}},
booktitle = {Proceedings SSBA´11 Symposium on Image Analysis},
year = {2011},
}
A new approach to track bicycles from imagery sensor data is proposed. It is based on detecting ellipsoids in the images, and treat these pair-wise using a dynamic bicycle model. One important application area is in automotive collision avoidance systems, where no dedicated systems for bicyclists yet exist and where very few theoretical studies have been published.
Possible conflicts can be predicted from the position and velocity state in the model, but also from the steering wheel articulation and roll angle that indicate yaw changes before the velocity vector changes. An algorithm is proposed which consists of an ellipsoid detection and estimation algorithm and a particle filter.
A simulation study of three critical single target scenarios is presented, and the algorithm is shown to produce excellent state estimates. An experiment using a stationary camera and the particle filter for state estimation is performed and has shown encouraging results.
@inproceedings{diva2:430910,
author = {Ardeshiri, Tohid and Larsson, Fredrik and Gustafsson, Fredrik and Schön, Thomas B. and Felsberg, Michael},
title = {{Bicycle Tracking Using Ellipse Extraction}},
booktitle = {Proceedings of the 14thInternational Conference on Information Fusion, 2011},
year = {2011},
pages = {1--8},
publisher = {IEEE},
}
Convolution kernels are a commonly used tool in computer vision. These kernels are often specified by an ideal frequency response and the actual filter coefficients are obtained by minimizing some weighted distance with respect to the ideal filter. State-of-the-art approaches usually replace the continuous frequency response by a discrete Fourier spectrum with a multitude of samples compared to the kernel size, depending on the smoothness of the ideal filter and the weight function. The number of samples in the Fourier domain grows exponentially with the dimensionality and becomes a bottleneck concerning memory requirements.
In this paper we propose a method that avoids the discretization of the frequency space and makes filter optimization feasible in higher dimensions than the standard approach. The result is no longer depending on the choice of the sampling grid and it remains exact even if the weighting function is singular in the origin. The resulting improper integrals are efficiently computed using Gauss-Jacobi quadrature.
@inproceedings{diva2:429706,
author = {Krebs, Andreas and Wiklund, Johan and Felsberg, Michael},
title = {{Optimization of Quadrature Filters Based on the Numerical Integration of Improper Integrals}},
booktitle = {Pattern Recognition},
year = {2011},
series = {Lecture Notes in Computer Science},
volume = {6835},
pages = {91--100},
publisher = {Springer Berlin/Heidelberg},
}
Traffic sign recognition is important for the development of driver assistance systems and fully autonomous vehicles. Even though GPS navigator systems works well for most of the time, there will always be situations when they fail. In these cases, robust vision based systems are required. Traffic signs are designed to have distinct colored fields separated by sharp boundaries. We propose to use locally segmented contours combined with an implicit star-shaped object model as prototypes for the different sign classes. The contours are described by Fourier descriptors. Matching of a query image to the sign prototype database is done by exhaustive search. This is done efficiently by using the correlation based matching scheme for Fourier descriptors and a fast cascaded matching scheme for enforcing the spatial requirements. We demonstrated on a publicly available database state of the art performance.
@inproceedings{diva2:428290,
author = {Larsson, Fredrik and Felsberg, Michael},
title = {{Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition}},
booktitle = {Image Analysis},
year = {2011},
series = {Lecture Notes in Computer Science},
volume = {6688},
pages = {238--249},
publisher = {Springer Berlin/Heidelberg},
}
Techniques from the theory of partial differential equations are often used to design filter methods that are locally adapted to the image structure. These techniques are usually used in the investigation of gray-value images. The extension to color images is non-trivial, where the choice of an appropriate color space is crucial. The RGB color space is often used although it is known that the space of human color perception is best described in terms of non-euclidean geometry, which is fundamentally different from the structure of the RGB space. Instead of the standard RGB space, we use a simple color transformation based on the theory of finite groups. It is shown that this transformation reduces the color artifacts originating from the diffusion processes on RGB images. The developed algorithm is evaluated on a set of real-world images, and it is shown that our approach exhibits fewer color artifacts compared to state-of-the-art techniques. Also, our approach preserves details in the image for a larger number of iterations.
@inproceedings{diva2:424137,
author = {Freddie, Åström and Michael, Felsberg and Reiner, Lenz},
title = {{Color Persistent Anisotropic Diffusion of Images}},
booktitle = {Image Analysis},
year = {2011},
series = {Lecture Notes in Computer Science},
volume = {6688},
pages = {262--272},
publisher = {Springer},
address = {Heidelberg},
}
This work employs data mining algorithms to discover visual entities that are strongly associated to autonomously discovered modes of action, in an embodied agent. Mappings are learnt from these perceptual entities, onto the agents action space. In general, low dimensional action spaces are better suited to unsupervised learning than high dimensional percept spaces, allowing for structure to be discovered in the action space, and used to organise the perceptual space. Local feature configurations that are strongly associated to a particular ‘type’ of action (and not all other action types) are considered likely to be relevant in eliciting that action type. By learning mappings from these relevant features onto the action space, the system is able to respond in real time to novel visual stimuli. The proposed approach is demonstrated on an autonomous navigation task, and the system is shown to identify the relevant visual entities to the task and to generate appropriate responses.
@inproceedings{diva2:385379,
author = {Ellis, Liam and Felsberg, Michael and Bowden, Richard},
title = {{Affordance mining: Forming perception through action}},
booktitle = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
year = {2011},
series = {Lecture Notes in Computer Science},
volume = {6495},
pages = {525--538},
publisher = {Springer},
}
We present in this paper a research platform for development and evaluation of embodied visual object recognition strategies. The platform uses a stereoscopic peripheral-foveal camera system and a fast pan-tilt unit to perform saliency-based visual search. This is combined with a classification framework based on the bag-of-features paradigm with the aim of targeting, classifying and recognising objects. Interaction with the system is done via typed commands and speech synthesis. We also report the current classification performance of the system.
@inproceedings{diva2:441485,
author = {Wallenberg, Marcus and Forss\'{e}n, Per-Erik},
title = {{A Research Platform for Embodied Visual Object Recognition}},
booktitle = {Proceedings of SSBA 2010 Symposium on Image Analysis},
year = {2010},
series = {Centre for Image Analysis Report Series},
volume = {34},
pages = {137--140},
}
We study the problem of registering a sequence of scan lines (a strip) from an airborne push-broom imager to another sequence partly covering the same area. Such a registration has to compensate for deformations caused by attitude and speed changes in the aircraft. The registration is challenging, as both strips contain such deformations. Our algorithm estimates the 3D rotation of the camera for each scan line, by parametrising it as a linear spline with a number of knots evenly distributed in one of the strips. The rotations are estimated from correspondences between strips of the same area. Once the rotations are known, they can be compensated for, and each line of pixels can be transformed such that ground trace of the two strips are registered with respect to each other.
@inproceedings{diva2:441244,
author = {Ringaby, Erik and Ahlberg, Jörgen and Forss\'{e}n, Per-Erik and Wadströmer, Niclas},
title = {{Co-alignmnent of Aerial Push-broom Strips using Trajectory Smoothness Constraints}},
booktitle = {SSBA10, Symposium on Image Analysis 11-12 March, Uppsala},
year = {2010},
pages = {63--66},
publisher = {Swedish Society for automated image analysis},
}
This paper presents a method for rectifying video sequences from rolling shutter (RS) cameras. In contrast to previous RS rectification attempts we model distortions as being caused by the 3D motion of the camera. The camera motion is parametrised as a continuous curve, with knots at the last row of each frame. Curve parameters are solved for using non-linear least squares over inter-frame correspondences obtained from a KLT tracker. We have generated synthetic RS sequences with associated ground-truth to allow controlled evaluation. Using these sequences, we demonstrate that our algorithm improves over to two previously published methods. The RS dataset is available on the web to allow comparison with other methods
@inproceedings{diva2:440573,
author = {Forss\'{e}n, Per-Erik and Ringaby, Erik},
title = {{Rectifying rolling shutter video from hand-held devices}},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010},
year = {2010},
pages = {507--514},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
}
We have performed a field trial with an airborne push-broom hyperspectral sensor, making several flights over the same area and with known changes (e.g., moved vehicles) between the flights. Each flight results in a sequence of scan lines forming an image strip, and in order to detect changes between two flights, the two resulting image strips must be geometrically aligned and radiometrically corrected. The focus of this paper is the geometrical alignment, and we propose an image- and gyro-based method for geometric co-alignment (registration) of two image strips. The method is particularly useful when the sensor is not stabilized, thus reducing the need for expensive mechanical stabilization. The method works in several steps, including gyro-based rectification, global alignment using SIFT matching, and a local alignment using KLT tracking. Experimental results are shown but not quantified, as ground truth is, by the nature of the trial, lacking.
@inproceedings{diva2:440482,
author = {Ringaby, Erik and Ahlberg, Jörgen and Wadströmer, Niclas and Forss\'{e}n, Per-Erik},
title = {{Co-aligning Aerial Hyperspectral Push-broom Strips for Change Detection}},
booktitle = {Proc. SPIE 7835, Electro-Optical Remote Sensing, Photonic Technologies, and Applications IV},
year = {2010},
series = {Proceedings Spie},
volume = {7835},
pages = {Art.nr. 7835B-36--},
publisher = {SPIE - International Society for Optical Engineering},
}
We have suggested a novel method PRESTO-CAN including radial sampling, filtering and reconstruction of k-space data for 3D-plus-time resolved MRI. The angular increment of the profiles was based on the golden ratio, but the number of angular positions N was locked to be a prime number which guaranteed fix angle positions.The time resolution increased dramatically when the pro-files were partly removed from the k-space using the hourglass filter.We aim for utilizing the MRI-data for fMRI, where the echo times are long, TE ≈ 37-40 ms. This will result in field inhomogeneities and phase variations in the reconstructed images. Therefore, a new calibration and correction procedure was developed. We show that we are able to reconstruct images of the human brain with an image quality in line with what can be obtained by conventional Cartesian sampling.The pulse sequence for PRESTO-CAN was implemented by modifying an existing PRESTO sequence for Cartesian sampling. The effort involved was relatively small and a great advantage will be that we are able to use standard procedures for speeding up data acquisition, i.e. parallel imaging with SENSE.
@inproceedings{diva2:386056,
author = {Magnusson, Maria and Dahlqvist Leinhard, Olof and Brynolfsson, Patrik and Thyr, Per and Lundberg, Peter},
title = {{3D Magnetic Resonance Imaging of the Human Brain - Novel Radial Sampling, Filtering and Reconstruction}},
booktitle = {Proc of the 12th IASTED International Conference on Signal and Image Processing (SIP 2010), August 23 - 25, 2010, Lahaina, Maui, USA},
year = {2010},
series = {ACTA Press},
pages = {Track: 710-042--(8 pages)},
publisher = {ACTA Press},
address = {Calgary, AB, Canada},
}
We present a method for segmenting an arbitrary number of moving objects in image sequences using the geometry of 6 points in 2D to infer motion consistency. The method has been evaluated on the Hopkins155 database and surpasses current state-of-the-art methods such as SSC, both in terms of overall performance on two and three motions butalso in terms of maximum errors. The method works by nding initialclusters in the spatial domain, and then classifying each remaining pointas belonging to the cluster that minimizes a motion consistency score. In contrast to most other motion segmentation methods that are basedon an affine camera model, the proposed method is fully projective.
@inproceedings{diva2:376722,
author = {Zografos, Vasileios and Nordberg, Klas and Ellis, Liam},
title = {{Sparse motion segmentation using multiple six-point consistencies.}},
booktitle = {The 2nd International Workshop on Video Event Categorization, Tagging and Retrieval (VECTaR 2010)},
year = {2010},
series = {Lecture Notes in Computer Science},
volume = {6468},
pages = {338--348},
}
We propose a method for segmenting an arbitrary number of moving objects using the geometry of 6 points in 2D images to infer motion consistency. This geometry allows us to determine whether or not observations of 6 points over several frames are consistent with a rigid 3D motion. The matching between observations of the 6 points and an estimated model of their configuration in 3D space, is quantified in terms of a geometric error derived from distances between the points and 6 corresponding lines in the image. This leads to a simple motion inconsistency score, based on the geometric errors of 6points that in the ideal case should be zero when the motion of the points can be explained by a rigid 3D motion. Initial point clusters are determined in the spatial domain and merged in motion trajectory domain based on this score. Each point is then assigned to the cluster, which gives the lowest score.Our algorithm has been tested with real image sequences from the Hopkins155 database with very good results, competing withthe state of the art methods, particularly for degenerate motion sequences. In contrast to the motion segmentation methods basedon multi-body factorization, that assume an affine camera model, the proposed method allows the mapping from 3D space to the 2D image to be fully projective.
@inproceedings{diva2:376712,
author = {Nordberg, Klas and Zografos, Vasileios},
title = {{Multibody motion segmentation using the geometry of 6 points in 2D images.}},
booktitle = {International Conference on Pattern Recognition},
year = {2010},
series = {International Conference on Pattern Recognition},
pages = {1783--1787},
publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
}
@inproceedings{diva2:358056,
author = {Felsberg, Michael and Larsson, Fredrik},
title = {{Learning object tracking in image sequences}},
booktitle = {International Conference on Cognitive Systems},
year = {2010},
}
A common computer vision task is navigation and mapping. Many indoor navigation tasks require depth knowledge of flat, unstructured surfaces (walls, floor, ceiling). With passive illumination only, this is an ill-posed problem. Inspired by small children using a torchlight, we use a spotlight for active illumination. Using our torchlight approach, depth and orientation estimation of unstructured, flat surfaces boils down to estimation of ellipse parameters. The extraction of ellipses is very robust and requires little computational effort.
@inproceedings{diva2:358053,
author = {Felsberg, Michael and Larsson, Fredrik and Wang, Han and Ynnerman, Anders and Schön, Thomas},
title = {{Torchlight Navigation}},
booktitle = {Proceedings of the 20th International Conferenceon Pattern Recognition},
year = {2010},
series = {International Conference on Pattern Recognition},
pages = {302--306},
}
Feature hierarchies are essential to many visual object recognition systems and are well motivated by observations in biological systems. The present paper proposes an algorithm to incrementally compute feature hierarchies. The features are represented as estimated densities, using a variant of local soft histograms. The kernel functions used for this estimation in conjunction with their unitary extension establish a tight frame and results from framelet theory apply. Traversing the feature hierarchy requires resampling of the spatial and the feature bins. For the resampling, we derive a multi-resolution scheme for quadratic spline kernels and we derive an optimization algorithm for the upsampling. We complement the theoretic results by some illustrative experiments, consideration of convergence rate and computational efficiency.
@inproceedings{diva2:358047,
author = {Felsberg, Michael},
title = {{Incremental computation of feature hierarchies}},
booktitle = {Pattern Recognition},
year = {2010},
series = {Lecture Notes in Computer Science},
volume = {6376},
pages = {523--532},
publisher = {Springer Berlin/Heidelberg},
}
A common computer vision task is navigation and mapping. Many indoor navigation tasks require depth knowledge of at, unstructured surfaces (walls, oor, ceiling). With passive illumination only, this is an ill-posed problem. Inspired by small children using a torchlight, we use a spotlight for active illumination. Using our torchlight approach, depth and orientation estimation of unstructured, at surfaces boils down to estimation of ellipse parameters. The extraction of ellipses is very robust and requires little computational effort.
@inproceedings{diva2:342959,
author = {Felsberg, Michael and Larsson, Fredrik and Han, Wang and Ynnerman, Anders and Schön, Thomas},
title = {{Torch Guided Navigation}},
booktitle = {Proceedings of the 2010 SSBA Symposium},
year = {2010},
pages = {8--9},
}
@inproceedings{diva2:342954,
author = {Felsberg, Michael},
title = {{Efficient Computation of Feature Hierarchies using Framelets}},
booktitle = {Inverse Problems and Applications},
year = {2010},
}
@inproceedings{diva2:342953,
author = {Wiklund, Johan and Nordberg, Klas and Felsberg, Michael},
title = {{Software architecture and middleware for artificial cognitive systems}},
booktitle = {International Conference on Cognitive Systems},
year = {2010},
}
@inproceedings{diva2:342951,
author = {Hedborg, Johan and Felsberg, Michael},
title = {{Fast and Robust Relative Pose Estimation for Forward and Sideways Motions}},
booktitle = {SSBA},
year = {2010},
}
In this position paper, we seek to extend the layered perception-action paradigm for on-line learning such that it includes an explicit symbolic processing capability. By incorporating symbolic processing at the apex of the perception action hierarchy in this way, we ensure that abstract symbol manipulation is fully grounded, without the necessity of specifying an explicit representational framework. In order to carry out this novel interfacing between symbolic and sub-symbolic processing, it is necessary to embed fuzzy rst-order logic theorem proving within a variational framework. The online learning resulting from the corresponding Euler-Lagrange equations establishes an extended adaptability compared to the standard subsumption architecture. We discuss an application of this approach within the eld of advanced driver assistance systems, demonstrating that a closed-form solution to the Euler Lagrange optimization problem is obtainable for simple cases.
@inproceedings{diva2:342948,
author = {Felsberg, Michael and Shaukat, Affan and Windridge, David},
title = {{Online Learning in Perception-Action Systems}},
booktitle = {ECCV 2010 Workshop on Vision for Cognitive Tasks},
year = {2010},
}
Commonly, surveillance operators are today monitoring a large number of CCTV screens, trying to solve the complex cognitive tasks of analyzing crowd behavior and detecting threats and other abnormal behavior. Information overload is a rule rather than an exception. Moreover, CCTV footage lacks important indicators revealing certain threats, and can also in other respects be complemented by data from other sensors. This article presents an approach to automatically interpret sensor data and estimate behaviors of groups of people in order to provide the operator with relevant warnings. We use data from distributed heterogeneous sensors (visual cameras and a thermal infrared camera), and process the sensor data using detection algorithms. The extracted features are fed into a hidden Markov model in order to model normal behavior and detect deviations. We also discuss the use of radars for weapon detection.
@inproceedings{diva2:846261,
author = {Andersson, Maria and Rydell, Joakim and Ahlberg, Jörgen},
title = {{Estimation of crowd behaviour using sensor networks and sensor fusion}},
booktitle = {12th International Conference on Information Fusion (FUSION)},
year = {2009},
pages = {396--403},
publisher = {IEEE conference proceedings},
}
A single-camera gaze tracker has been created,based on previous implementations by Shih and Liu[5] and Hennessey et al. [3]. The method used is basedon controlled infrared illumination. The implementedsystem has been evaluated on both synthetic and realimage data and found to be capable of estimatinggaze point with an accuracy of approximately 1° visualangle.
@inproceedings{diva2:441486,
author = {Wallenberg, Marcus},
title = {{A Simple Single-Camera Gaze Tracker using Infrared Illumination}},
booktitle = {Proceedings of SSBA 2009 Symposium on Image Analysis},
year = {2009},
pages = {53--56},
}
We compare the performance of two real-time multimedia communication systems for quality versus end-to-end delay. We develop an analytical framework for comparison when the systems use a deterministic time-varying channel. Moreover, we assess their performance for the Gilbert-Elliott channel model which alternates between a good and a bad state with time durations that are exponentially distributed. The goal of the paper is to select the best system with low average distortion while obeying a real-time constraint.
@inproceedings{diva2:415217,
author = {Muhammad, Ajmal and Johansson, Peter and Forchheimer, Robert},
title = {{Effect of Buffer Placement on Performance When Communicating Over a Rate-Variable Channel}},
booktitle = {ICSNC 2009},
year = {2009},
publisher = {IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA},
}
Tensor valued data are frequently used in medical imaging. For a 3-dimensional second order tensor such data imply at least six degrees of freedom for each voxel. The operators ability to perceive this information is of outmost importance and in many cases a limiting factor for the interpretation of the data. In this paper we propose a decomposition of such tensor fields using the Tflash tensor glyphs that intuitively conveys important tensor features to a human observer. A matlab implementation for visualization of single tensors are described in detail and a VTK/ITK implementation for visualization of tensor fields have been developed as a Medical Studio component.
@inproceedings{diva2:355247,
author = {Wiklund, Johan and Nicolas, Vincent and Alface, Patrice R. and Andersson, Mats and Knutsson, Hans},
title = {{T-flash: Tensor Visualization in Medical Studio}},
booktitle = {Tensors in Image Processing and Computer Vision},
year = {2009},
series = {Advances in Pattern Recognition},
pages = {455--466},
publisher = {Springer London},
}
In this work we examine in detail the use of optimisation algorithms on deformable template matching problems. We start with the examination of simple, direct-search methods and move on to more complicated evolutionary approaches. Our goal is twofold: first, evaluate a number of methods examined under different template matching settings and introduce the use of certain, novel evolutionary optimisation algorithms to computer vision, and second, explore and analyse any additional advantages of using a hybrid approach over existing methods. We show that in computer vision tasks, evolutionary strategies provide very good choices for optimisation. Our experiments have also indicated that we can improve the convergence speed and results of existing algorithms by using a hybrid approach.
@inproceedings{diva2:280074,
author = {Zografos, Vasileios},
title = {{Comparison of Optimisation Algorithms for Deformable Template Matching}},
booktitle = {Advances in Visual Computing},
year = {2009},
series = {Lecture notes in computer science},
volume = {5876},
pages = {1097--1108},
publisher = {Springer},
address = {Berlin},
}
We present a new method for matching a region between an input and a query image, based on the P-channel representation of pixel-based image features such as grayscale and color information, local gradient orientation and local spatial coordinates. We introduce the concept of integral P-channels, which conciliates the concepts of P-channel and integral images. Using integral images, the P-channel representation of a given region is extracted with a few arithmetic operations. This enables a fast nearest-neighbor search in all possible target regions. We present extensive experimental results and show that our approach compares favorably to existing methods for region matching such as histograms or region covariance.
@inproceedings{diva2:342995,
author = {Pagani, Alain and Stricker, Didier and Felsberg, Michael},
title = {{Integral P-channels for fast and robust region matching}},
booktitle = { Image Processing (ICIP), 2009 16th IEEE International Conference},
year = {2009},
pages = {213--216},
}
Linear scale-space theory is the fundamental building block for many approaches to image processing like pyramids or scale-selection. However, linear smoothing does not preserve image structures very well and thus non-linear techniques are mostly applied for image enhancement. A different perspective is given in the framework of channel-smoothing, where the feature domain is not considered as a linear space, but it is decomposed into local basis functions. One major drawback is the larger memory requirement for this type of representation, which is avoided if the channel representation is subsampled in the spatial domain. This general type of feature representation is called channel-coded feature map (CCFM) in the literature and a special case using linear channels is the SIFT descriptor. For computing CCFMs the spatial resolution and the feature resolution need to be selected.
In this paper, we focus on the spatio-featural scale-space from a scale-selection perspective. We propose a coupled scheme for selecting the spatial and the featural scales. The scheme is based on an analysis of lower bounds for the product of uncertainties, which is summarized in a theorem about a spatio-featural uncertainty relation. As a practical application of the derived theory, we reconstruct images from CCFMs with resolutions according to our theory. The results are very similar to the results of non-linear evolution schemes, but our algorithm has the fundamental advantage of being non-iterative. Any level of smoothing can be achieved with about the same computational effort.
@inproceedings{diva2:342992,
author = {Felsberg, Michael},
title = {{Spatio-featural scale-space}},
booktitle = {Swedish Symposium on Image Analysis - SSBA'2009, 18-20 March, Halmstad, Sweden},
year = {2009},
}
Fourier descriptors (FDs) is a classical but still popular method for contour matching. The key idea is to apply the Fourier transform to a periodic representation of the contour, which results in a shape descriptor in the frequency domain. Fourier descriptors have mostly been used to compare object silhouettes and object contours; we instead use this well established machinery to describe local regions to be used in an object recognition framework. We extract local regions using the Maximally Stable Extremal Regions (MSER) detector and represent the external contour by FDs. Many approaches to matching FDs are based on the magnitude of each FD component, thus ignoring the information contained in the phase. Keeping the phase information requires us to take into account the global rotation of the contour and shifting of the contour samples. We show that the sum-of-squared differences of FDs can be computed without explicitly de-rotating the contours. We compare our correlation based matching against affine-invariant Fourier descriptors (AFDs) and demonstrate that our correlation based approach outperforms AFDs on real world data.
@inproceedings{diva2:276794,
author = {Larsson, Fredrik and Felsberg, Michael and Forss\'{e}n, Per-Erik},
title = {{Patch Contour Matching by Correlating Fourier Descriptors}},
booktitle = {Digital Image Computing: Techniques and Applications (DICTA)},
year = {2009},
pages = {40--46},
publisher = {IEEE Computer Society},
}
This paper describes a system that efficiently uses the KLT tracker together with a calibrated 5-point solver for structure-from-motion (SfM). Our system uses a GPU to perform tracking, and the CPU for SfM.
In this setup, it is advantageous to run the tracker both forwards and backwards in time, to detect incorrectly tracked points. We introduce a modification to the point selection inside the RANSAC step of the 5-point solver, and demonstrate how this speeds up the algorithm. Our evaluations are done using both real camera sequences, and data from a state-of-the art rendering engine with associated ground-truth.
@inproceedings{diva2:271773,
author = {Hedborg, Johan and Forss\'{e}n, Per-Erik},
title = {{Fast and Accurate Ego-Motion Estimation}},
booktitle = {Swedish Symposium on Image Analysis - SSBA'2009, March 18-20, Halmstad, Sweden},
year = {2009},
}
There has been a rapid progress of the graphics processor the last years, much because of the demands from computer games on speed and realistic rendering. Because of the graphics processor’s special architecture it is much faster at solving parallel problems than the normal processor. Due to its increasing generality it is possible to use it for other tasks than it was originally designed for.
Even though graphics processors have been programmable for some time, it has been quite difficult to learn how to use them. CUDA (Compute Unified Device Architecture) enables the programmer to use C-code, with a few extensions, to program NVIDIA’s graphics processor and completely skip the traditional programming models. This paper investigates if the graphics processor can be used for calculations without knowledge of how the hardware mechanisms work. An image processing algorithm calculating the optical flow has been implemented. The result shows that it is rather easy to implement programs using CUDA, but some knowledge of how the graphics processor works is required to achieve high performance.
@inproceedings{diva2:271770,
author = {Ringaby, Erik},
title = {{Optical Flow Computation on CUDA}},
booktitle = {SSBA},
year = {2009},
pages = {81--84},
}
Radial sampling of k-space is known to simultaneously provide both high spatial and high temporal resolution. Recently, an optimal radial profile time order based on the Golden Ratio was presented in [1]. We have adopted and modified the idea, with a focus onhigher temporal resolution without sacrificing any image quality.
[1] Winkelmann et. al.: An optimal radial profile order based on the golden ratio for time-resolved MRI, IEEE Trans. Med. Im, Vol.26, No.1,2007.
@inproceedings{diva2:271769,
author = {Magnusson, Maria and Dahlqvist Leinhard, Olof and Brynolfsson, Patrik and Lundberg, Peter},
title = {{Improved temporal resolution in radial k-space sampling using an hourglass filter}},
booktitle = {ISMRM 17th Scientific Meeting \& Exhibition},
year = {2009},
address = {Honolulu, Hawaii, USA},
}
Radial sampling of k-space is known to simultaneously provide both high spatial and high temporal resolution. Recently, an optimalradial profile time order based on the Golden Ratio was presented in [1]. We have adopted and modified the idea, with a focus onhigher temporal resolution without sacrificing any image quality.
[1] Winkelmann et. al.: An optimal radial profileorder based on the golden ratio for time-resolved MRI, IEEE Trans. Med. Im, Vol.26, No.1, 2007.
@inproceedings{diva2:271766,
author = {Magnusson, Maria and Dahlqvist Leinhard, Olof and Brynolfsson, Patrik and Lundberg, Peter},
title = {{Radial k-space sampling: step response using different filtering techniques}},
booktitle = {ISMRM Workshop on Data sampling and Image Reconstruction},
year = {2009},
address = {The Enchantment Resort, Sedona, Arizona, USA},
}
This paper describes a system for structure-and-motion estimation for real-time navigation and obstacle avoidance. We demonstrate it technique to increase the efficiency of the 5-point solution to the relative pose problem. This is achieved by a novel sampling scheme, where We add a distance constraint on the sampled points inside the RANSAC loop. before calculating the 5-point solution. Our setup uses the KLT tracker to establish point correspondences across tone in live video We also demonstrate how an early outlier rejection in the tracker improves performance in scenes with plenty of occlusions. This outlier rejection scheme is well Slated to implementation on graphics hardware. We evaluate the proposed algorithms using real camera sequences with fine-tuned bundle adjusted data as ground truth. To strenghten oar results we also evaluate using sequences generated by a state-of-the-art rendering software. On average we are able to reduce the number of RANSAC iterations by half and thereby double the speed.
@inproceedings{diva2:271764,
author = {Hedborg, Johan and Forss\'{e}n, Per-Erik and Felsberg, Michael},
title = {{Fast and Accurate Structure and Motion Estimation}},
booktitle = {International Symposium on Visual Computing},
year = {2009},
series = {Lecture Notes in Computer Science},
volume = {Volume 5875},
pages = {211--222},
publisher = {Springer-Verlag},
address = {Berlin Heidelberg},
}
The paper describes a minimal set of 18 parameters that can representany trifocal tensor consistent with the internal constraints. 9parameters describe three orthogonal matrices and 9 parameters describe10 elements of a sparse tensor T' with 17 elements in well-defined positions equal to zero. Any valid trifocal tensor isthen given as some specific T' transformed by the orthogonalmatrices in the respective image domain. The paper also describes asimple approach for estimating the three orthogonal matrices in thecase of a general 3 x 3 x 3 tensor, i.e., when the internalconstraints are not satisfied. This can be used to accomplish a leastsquares approximation of a general tensor to a tensor that satisfies the internal constraints. This type of constraint enforcement, inturn, can be used to obtain an improved estimate of the trifocal tensorbased on the normalized linear algorithm, with the constraintenforcement as a final step. This makes the algorithm more similar tothe corresponding algorithm for estimation of the fundamental matrix. An experiment on synthetic data shows that the constraint enforcementof the trifocal tensor produces a significantly better result thanwithout enforcement, expressed by the positions of the epipoles, giventhat the constraint enforcement is made in normalized image coordinates.
@inproceedings{diva2:271736,
author = {Nordberg, Klas},
title = {{A minimal parameterization of the trifocal tensor}},
booktitle = {IEEE Computer Science Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2009},
pages = {1224--1230},
}
This work presents a novel object tracking approach, where the motion model is learned from sets of frame-wise detections with unknown associations. We employ a higher-order Markov model on position space instead of a first-order Markov model on a high-dimensional state-space of object dynamics. Compared to the latter, our approach allows the use of marginal rather than joint distributions, which results in a significant reduction of computation complexity. Densities are represented using a grid-based approach, where the rectangular windows are replaced with estimated smooth Parzen windows sampled at the grid points. This method performs as accurately as particle filter methods with the additional advantage that the prediction and update steps can be learned from empirical data. Our method is compared against standard techniques on image sequences obtained from an RC car following scenario. We show that our approach performs best in most of the sequences. Other potential applications are surveillance from cheap or uncalibrated cameras and image sequence analysis.
@inproceedings{diva2:342945,
author = {Felsberg, Michael and Larsson, Fredrik},
title = {{Learning Higher-Order Markov Models for ObjectTracking in Image Sequences}},
booktitle = {Proceedings of the 5th International Symposium on Advances in Visual Computing: Part II},
year = {2009},
series = {Lecture Notes in Computer Science},
volume = {5876},
pages = {184--195},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
}
@inproceedings{diva2:271399,
author = {Larsson, Fredrik and Forss\'{e}n, Per-Erik and Felsberg, Michael},
title = {{Using Fourier descriptors for local region matching}},
booktitle = {SSBA},
year = {2009},
}
Recent years have seen advances in the estimation of full 6 degree-of-freedom object pose from a single 2D image. These advances have often been presented as a result of, or together with, a new local image descriptor. This paper examines how the performance for such a system varies with choice of local descriptor. This is done by comparing the performance of a full 6 degree-of-freedom pose estimation system for fourteen types of local descriptors. The evaluation is done on a database with photos of complex objects with simple and complex backgrounds and varying lighting conditions. From the experiments we can conclude that duplet features, that use pairs of interest points, improve pose estimation accuracy, and that affine covariant features do not work well in current pose estimation frameworks. The data sets and their ground truth is available on the web to allow future comparison with novel algorithms.
@inproceedings{diva2:265756,
author = {Viksten, Fredrik and Forss\'{e}n, Per-Erik and Johansson, Björn and Moe, Anders},
title = {{Comparison of Local Image Descriptors for Full 6 Degree-of-Freedom Pose Estimation}},
booktitle = {IEEE ICRA, 2009},
year = {2009},
pages = {2779--2786},
publisher = {IEEE Robotics and Automation Society},
address = {Kobe},
}
Linear scale-space theory is the fundamental building block for many approaches to image processing like pyramids or scale-selection. However, linear smoothing does not preserve image structures very well and thus non-linear techniques are mostly applied for image enhancement. A different perspective is given in the framework of channel-smoothing, where the feature domain is not considered as a linear space, but it is decomposed into local basis functions. One major drawback is the larger memory requirement for this type of representation, which is avoided if the channel representation is subsampled in the spatial domain. This general type of feature representation is called channel-coded feature map (CCFM) in the literature and a special case using linear channels is the SIFT descriptor. For computing CCFMs the spatial resolution and the feature resolution need to be selected. In this paper, we focus on the spatio-featural scale-space from a scale-selection perspective. We propose a coupled scheme for selecting the spatial and the featural scales. The scheme is based on an analysis of lower bounds for the product of uncertainties, which is summarized in a theorem about a spatio-featural uncertainty relation. As a practical application of the derived theory, we reconstruct images from CCFMs with resolutions according to our theory. The results are very similar to the results of non-linear evolution schemes, but our algorithm has the fundamental advantage of being non-iterative. Any level of smoothing can be achieved with about the same computational effort.
@inproceedings{diva2:216715,
author = {Felsberg, Michael},
title = {{Spatio-featural scale-space}},
booktitle = {Scale Space and Variational Methods in Computer Vision},
year = {2009},
series = {Lecture Notes in Computer Science},
volume = {5567},
pages = {808--819},
publisher = {Springer Berlin/Heidelberg},
}
The on-going EU funded project Prometheus (FP7-214901) aims at establishing a general framework which links fundamental sensing tasks to automated cognition processes enabling interpretation and short-term prediction of individual and collective human behaviours in unrestricted environments as well as complex human interactions. To achieve the aforementioned goals, the Prometheus consortium works on the following core scientific and technological objectives:
1. sensor modeling and information fusion from multiple, heterogeneous perceptual modalities;
2. modeling, localization, and tracking of multiple people;
3. modeling, recognition, and short-term prediction of continuous complex human behavior.
@inproceedings{diva2:846265,
author = {Ahlberg, Jörgen and Arsic, Dejan and Ganchev, Todor and Linderhed, Anna and Menezes, Paolo and Ntalampiras, Stavros and Olma, Tadeusz and Potamitis, Ilyas and Ros, Julien},
title = {{Prometheus: Prediction and interpretation of human behaviour based on probabilistic structures and heterogeneous sensors}},
booktitle = {European Conference on Artificial Intelligence (ECAI)},
year = {2008},
publisher = {European Coordinating Committee for Artificial Intelligence (ECCAI)},
}
Good data sets for evaluation of computer visionalgorithms are important for the continuedprogress of the field. There exist good evaluationsets for many applications, but there are othersfor which good evaluation sets are harder to comeby. One such example is feature tracking, wherethere is an obvious difficulty in the collection ofdata. Good evaluation data is important both forcomparisons of different algorithms, and to detectweaknesses in a specific method.All image data is a result of light interactingwith its environment. These interactions are sowell modelled in rendering software that sometimesnot even the sharpest human eye can tell the differencebetween reality and simulation. In this paperwe thus propose to use a high quality renderingsystem to create evaluation data for sparse pointcorrespondence trackers.
@inproceedings{diva2:343534,
author = {Hedborg, Johan and Forss\'{e}n, Per-Erik},
title = {{Synthetic Ground Truth for Feature Trackers}},
booktitle = {Swedish Symposium on Image Analysis 2008},
year = {2008},
}
Triangulation of a 3D point from two or more views can be solved inseveral ways depending on how perturbations in the image coordinatesare dealt with. A common approach is optimal triangulation which minimizes the total L2 reprojection error in the images,corresponding to finding a maximum likelihood estimate of the 3Dpoint assuming independent Gaussian noise in the image spaces.Computational approaches for optimal triangulation have beenpublished for the stereo case and, recently, also for the three-viewcase. In short, they solve an independent optimization problem foreach 3D point, using relatively complex computations such as findingroots of high order polynomials or matrix decompositions. Thispaper discuss three-view triangulation and reports the followingresults: (1) the 3D point can be computed as multi-linear mapping(tensor) applied on the homogeneous image coordinates, (2) the set of triangulation tensors forms a 7-dimensional space determined by the camera matrices, (3) given a set of corresponding 3D/2Dcalibration points, the 3D residual L1 errors can be optimized over the elements in the 7-dimensional space, (4) using theresulting tensor as initial value, the error can be further reducedby tuning the tensor in a two-step iterative process, (5) the 3Dresidual L1 error for a set of evaluation points which lieclose to the calibration set is comparable to the three-view optimalmethod. In summary, three-view triangulation can be done by firstperforming an optimization of the triangulation tensor and once this is done, triangulation can be made with 3D residual error at thesame level as the optimal method, but at a much lower computationalcost. This makes the proposed method attractive for real-time three-view triangulation of large data sets provided that thenecessary calibration process can be performed.
@inproceedings{diva2:271746,
author = {Nordberg, Klas},
title = {{Efficient Three-view Triangulation Based on 3D Optimization}},
booktitle = {Proceedings of the British Machine Vision Conference 2008},
year = {2008},
pages = {19.1--19.10},
publisher = {BMVA Press},
}
Point-of-interest detection is a way of reducing the amount of data that needs to be processed in a certain application and is widely used in 2D image analysis. In 2D image analysis, point-of-interest detection is usually related to extraction of local descriptors for object recognition, classification, registration or pose estimation. In analysis of range data however, some local descriptors have been published in the last decade or so, but most of them do not mention any kind of point-of-interest detection. We here show how to use an extended Harris detector on range data and discuss variants of the Harris measure. All described variants of the Harris detector for 3D should also be usable in medical image analysis, but we focus on the range data case. We do present a performance evaluation of the described variants of the Harris detector on range data.
@inproceedings{diva2:265790,
author = {Viksten, Fredrik and Nordberg, Klas and Kalms, Mikael},
title = {{Point-of-Interest Detection for Range Data}},
booktitle = {International Conference on Pattern Recognition (ICPR)},
year = {2008},
series = {Pattern Recognition},
pages = {1--4},
publisher = {IEEE},
}
@inproceedings{diva2:265771,
author = {Larsson, Fredrik and Jonsson, Erik and Felsberg, Michael},
title = {{Learning Floppy Robot Control}},
booktitle = {SSBA,2008},
year = {2008},
pages = {39--42},
}
This paper studies the sequential object recognition problem faced by a mobile robot searching for specific objects within a cluttered environment. In contrast to current state-of-the-art object recognition solutions which are evaluated on databases of static images, the system described in this paper employs an active strategy based on identifying potential objects using an attention mechanism and planning to obtain images of these objects from numerous viewpoints. We demonstrate the use of a bag-of-features technique for ranking potential objects, and show that this measure outperforms geometric matching for invariance across viewpoints. Our system implements informed visual search by prioritising map locations and re-examining promising locations first. Experimental results demonstrate that our system is a highly competent object recognition system that is capable of locating numerous challenging objects amongst distractors.
@inproceedings{diva2:265763,
author = {Forss\'{e}n, Per-Erik and Meger, David and Lai, Kevin and Helmer, Scott and Little, James J. and Lowe, David G.},
title = {{Informed Visual Search:
Combining Attention and Object Recognition}},
booktitle = {Proceedings - IEEE International Conference on Robotics and Automation\emph{}},
year = {2008},
series = {Robotics and Automation},
pages = {935--942},
publisher = {IEEE Robotics and Automation Society},
address = {Pasadena},
}
This paper presents a method for triangulation of 3D points given their projections in two images. Recent results show that the triangulation mapping can be represented as a linear operator K applied to the outer product of corresponding homogeneous image coordinates, leading to a triangulation of very low computational complexity. K can be determined from the camera matrices, together with a so-called blind plane, but we show here that it can be further refined by a process similar to Gold Standard methods for camera matrix estimation. In particular it is demonstrated that K can be adjusted to minimize the Euclidean L, residual 3D error, bringing it down to the same level as the optimal triangulation by Hartley and Sturm. The resulting K optimally fits a set of 2D+2D+3D data where the error is measured in the 3D space. Assuming that this calibration set is representative for a particular application, where later only the 2D points are known, this K can be used for triangulation of 3D points in an optimal way, which in addition is very efficient since the optimization need only be made once for the point set. The refinement of K is made by iteratively reducing errors in the 3D and 2D domains, respectively. Experiments on real data suggests that very few iterations are needed to accomplish useful results.
@inproceedings{diva2:265774,
author = {Nordberg, Klas},
title = {{Efficient Triangulation Based on 3D Euclidean Optimization}},
booktitle = {International Conference on Pattern Recognition (ICPR)},
year = {2008},
series = {IEEE Computer Society},
pages = {1--4},
publisher = {IEEE},
}
@inproceedings{diva2:265776,
author = {Nordberg, Klas},
title = {{Learning based on subspace voting}},
booktitle = {Swedish Symposium on Image Analysis (SSBA)},
year = {2008},
}
In pattern recognition, computer vision, and image processing, many approaches are based on second order operators. Well-known examples are second order networks, the 3D structure tensor for motion estimation, and the Harris corner detector. A subset of second order operators are quadratic operators. It is lesser known that every second order operator can be written as a weighted quadratic operator. The contribution of this paper is to propose an algorithm for converting an arbitrary second order operator into a quadratic operator. We apply the method to several examples from image processing and machine learning. The advantages of the alternative implementation by quadratic operators is two-fold: The underlying linear operators allow new insights into the theory of the respective second order operators and replacing second order networks with sums of squares of linear networks reduces significantly the computational burden when the trained network is in operation phase.
@inproceedings{diva2:265738,
author = {Felsberg, Michael},
title = {{On Second Order Operators and Quadratic Operators}},
booktitle = {Proceedings - International Conference on Pattern Recognition},
year = {2008},
pages = {1--4},
publisher = {IEEE},
}
@inproceedings{diva2:265737,
author = {Felsberg, Michael and Larsson, Fredrik},
title = {{Learning Bayesian tracking for motion estimation}},
booktitle = {ECCV Workshop: Machine Learning for Vision-based Motion Analysis},
year = {2008},
}
@inproceedings{diva2:265741,
author = {Felsberg, Michael and Granlund, Gösta},
title = {{Fusing Dynamic Percepts and Symbols in Cognitive Systems}},
booktitle = {International Conference on Cognitive Systems},
year = {2008},
}
In this paper we present a novel numerical approximation scheme for anisotropic diffusion which is at the same time a special case of iterated adaptive filtering. By assuming a sufficiently smooth diffusion tensor field, we simplify the divergence term and obtain an evolution equation that is computed from a scalar product of diffusion tensor and the Hessian. We propose further a set of filters to approximate the Hessian on a minimized spatial support. On standard benchmarks, the resulting method performs in average nearly as good as the best known denoising methods from the literature, although it is significantly faster and easier to implement. In a GPU implementation video real-time performance is achieved for moderate noise levels.
@inproceedings{diva2:265740,
author = {Felsberg, Michael},
title = {{On the Relation Between Anisotropic Diffusion and Iterated Adaptive Filtering}},
booktitle = {Pattern Recognition},
year = {2008},
series = {Lecture Notes in Computer Science},
volume = {5096},
pages = {436--445},
publisher = {Springer Berlin/Heidelberg},
}
In this paper we address the parallel beam 2D computer tomography reconstruction. The proposed method belongs to the field of analytic reconstruction methods and is compared to several methods known in the field, among other the two-step Hilbert-transform method. In contrast to the latter, the derivative data is multiplied with an orientation vector and the Hilbert transform is replaced with the Riesz transform. Experimental results show that the new method is superior to established ones concerning aliasing, noise, and DC errors.
@inproceedings{diva2:262426,
author = {Felsberg, Michael},
title = {{A Novel two-step Method for CT Reconstruction}},
booktitle = {Bildverarbeitung für die Medizin},
year = {2008},
series = {Informatik aktuell},
pages = {303--307},
publisher = {Springer},
address = {Heidelberg},
}
A new projection operator is presented and evaluated. This operator has been designed to suppress aliasing artifacts due to (i) false high frequencies contained in the footprint function, and (ii) high frequencies caused by a divergent beam geometry. It is easy to implement and allows for efficient computer implementations. Instead of sampling the footprint as done in most projection operators, the footprint is integrated. This integration suppresses false high frequencies, frequency components that cause aliasing and approximately takes into account the finite size of focus and detector. Two-dimensional parallel beam experiments are presented. These experiments confirm that artifacts due to false high frequencies can be suppressed by the proposed technique. In order to investigate the advantages for divergent beam geometries, current experiments must be complemented with cone-beam experiments.
@inproceedings{diva2:273870,
author = {Sunnegårdh, Johan and Danielsson, Per-Erik},
title = {{A new anti-aliased projection operator for iterative CT reconstruction}},
booktitle = {Proceedings of the Ninth International Meeting on Fully Three-dimensional Image Reconstruction in Radiology and Nuclear Medicine, Lindau, Germany, July 9-13, 2007},
year = {2007},
}
We present a novel local descriptor for range data that can describe one or more planes or lines in a local region. It is possible to recover the geometry of the described local region and extract the size, position and orientation of each local plane or line-like structure from the descriptor. This gives the descriptor a property that other popular local descriptors for range data, such as spin images or point signatures, does not have. The estimation of the descriptor is dependant on estimation of surface normals but does not depend on the specific normal estimation method used. It is shown that is possible to extract how many planar surface regions the descriptor represents and that this could be used as a point-of-interest detector.
@inproceedings{diva2:273829,
author = {Viksten, Fredrik and Nordberg, Klas},
title = {{A Geometry-Based Local Descriptor for Range Data}},
booktitle = {Proceedings of the 9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications},
year = {2007},
pages = {210--217},
publisher = {ACM},
}
@inproceedings{diva2:265766,
author = {Forss\'{e}n, Per-Erik},
title = {{Learning Saccadic Gaze Control via Motion Prediction}},
booktitle = {IEEE Canadian CRV,2007},
year = {2007},
publisher = {IEEE Computer Society},
address = {Montreal},
}
@inproceedings{diva2:265765,
author = {Helmer, Scott and Meger, David and Forss\'{e}n, Per-Erik and Southey, Tristram and McCann, Sancho and Fazli, Pooyan and Little, James J. and Lowe, David G.},
title = {{The UBC Semantic Robot Vision System}},
booktitle = {AAAI,2007},
year = {2007},
publisher = {AAAI Press},
address = {Vancouver},
}
@inproceedings{diva2:265757,
author = {Forss\'{e}n, Per-Erik and Lowe, David G.},
title = {{Shape Descriptors for Maximally Stable Extremal Regions}},
booktitle = {IEEE ICCV,2007},
year = {2007},
publisher = {IEEE Computer Society},
address = {Rio de Janeiro, Brazil},
}
@inproceedings{diva2:265758,
author = {Forss\'{e}n, Per-Erik and Lowe, David G.},
title = {{Maximally Stable Colour Regions for Recognition and Matching}},
booktitle = {},
year = {2007},
publisher = {IEEE Computer Society},
address = {Minneapolis, USA},
}
@inproceedings{diva2:265780,
author = {Nordberg, Klas},
title = {{Point matching constraints in two and three views}},
booktitle = {Symposium of the German Association for Pattern Recognition (DAGM)},
year = {2007},
series = {LNCS},
volume = {4713},
publisher = {Berlin / Heidelberg},
address = {Springer},
}
A novel and computationally simple method is presented for triangulation of 3D points corresponding to the image coordinates in a pair of stereo images. The image points are described in terms of homogeneous coordinates which are jointly represented as the outer products of these homogeneous coordinates. This paper derives a linear transformation which maps the joint representation directly to the homogeneous representation of the corresponding 3D point in the scene. Compared to the other triangulation methods this approach gives similar reconstruction error but is numerically faster, since it only requires linear operations. The proposed method is projective invariant in the same way as the optimal method of Hartley and Sturm. The methods has a "blind plane"; a plane through the camera focal points which cannot be reconstructed by this method. For "forward-looking" camera configurations, however, the blind plane can be placed outside the visible scene and does not constitute a problem.
@inproceedings{diva2:265778,
author = {Nordberg, Klas},
title = {{A linear mapping for stereo triangulation}},
booktitle = {Scandiavian Conference on Image Analysis (SCIA)},
year = {2007},
series = {LNCS},
volume = {4522},
publisher = {Berlin / Heidelberg},
address = {Springer},
}
A single-view matching constraint is described which represents a necessary condition which 6 points in an image must satisfy if they are the images of 6 known 3D points under an arbitrary projective transformation. Similar to the well-known matching constrains for two or more view, represented by fundamental matrices or trifocal tensors, single-view matching constrains are represented by tensors and when multiplied with the homogeneous image coordinates the result vanishes when the condition is satisfied. More precisely, they are represented by 6-th order tensors on ℝ3 which can be computed in a simple manner from the camera projection matrix and the 6 3D points. The single-view matching constraints can be used for finding correspondences between detected 2D feature points and known 3D points, e.g., on an object, which are observed from arbitrary views. Consequently, this type of constraint can be said to be a representation of 3D shape (in the form of a point set) which is invariant to projective transformations when projected onto a 2D image.
@inproceedings{diva2:265779,
author = {Nordberg, Klas},
title = {{Single-View Matching Constraints}},
booktitle = {Advances in Visual Computing},
year = {2007},
series = {Lecture Notes in Computer Science},
volume = {4842},
pages = {397--406},
publisher = {Springer},
address = {Berlin/Heidelberg},
}
The paper presentsa method for projection generation through a 2-D pixel image or a3-D voxel volume. During the design of the method, we have strivedto apply knowledge from signal processing theory. Introductoryexperiments, were the projection generation method was used in aniterative CT reconstruction loop, indicate that the method issound. Our hope is that the method could be applied in manydifferent connections, were one task is to compute projectionsthrough a 2-D pixel image or a 3-D voxel volume. In the future weplan to do more experiments, both in 2-D and 3-D, which hopefullyfurther demonstrates the usefulness of themethod.
@inproceedings{diva2:263517,
author = {Magnusson, Maria},
title = {{Projection generation through voxel volumes considering signal processing theory}},
booktitle = {Fully 3D 2007, Ninth International Meeting on Fully Three-dimensional Image Reconstruction in Radiology and Nuclear Medicine,2007},
year = {2007},
}
@inproceedings{diva2:262474,
author = {Jonsson, Erik and Felsberg, Michael},
title = {{Accurate Interpolation in Appearance-Based Pose Estimation}},
booktitle = {Svenska Sällskapet för Automatiserad Bildanalys SSBA Symposium,2007},
year = {2007},
pages = {13--16},
}
The motion field from image sequences of a dynamic 3D scene is in general piecewise continuous. Since two neighbouring regions may have completely different motions, motion estimation at the discontinuities is problematic. In particular spatial averaging of motion vectors is inappropriate at such positions. We avoid this problem by channel encoding brightness change constraint equations (BCCE) for each spatial position into a channel matrix. By spatial averaging of this channel representation and subsequently decoding we are able to estimate all significantly different motions occurring at the discontinuity, as well as their covariances. This paper extends and improves this multiple motion estimation scheme by locally selecting the appropriate scale for the spatial averaging.
@inproceedings{diva2:262437,
author = {Forss\'{e}n, Per-Erik and Spies, Hagen},
title = {{Multiple Motion Estimation using Channel Matrices}},
booktitle = {International Workshop on Complex Motion IWCM,2004},
year = {2007},
pages = {54--},
publisher = {Springer},
address = {LNCS 3417},
}
@inproceedings{diva2:262431,
author = {Felsberg, Michael},
title = {{Extending Graph-Cut to Continuous Value Domain Minimization}},
booktitle = {SSBA,2007},
year = {2007},
}
@inproceedings{diva2:261669,
author = {Larsson, Fredrik and Jonsson, Erik and Felsberg, Michael},
title = {{Visual Servoing Based on Learned Inverse Kinematics}},
booktitle = {SSBA,2007},
year = {2007},
pages = {21--24},
}
One major goal of the COSPAL project is to develop an artificial cognitive system architecture with the capability of exploratory learning. Exploratory learning is a strategy that allows to apply generalization on a conceptual level, resulting in an extension of competences. Whereas classical learning methods aim at best possible generalization, i.e., concluding from a number of samples of a problem class to the problem class itself, exploration aims at applying acquired competences to a new problem class. Incremental or online learning is an inherent requirement to perform exploratory learning.
Exploratory learning requires new theoretic tools and new algorithms. In the COSPAL project, we mainly investigate reinforcement-type learning methods for exploratory learning and in this paper we focus on its algorithmic aspect. Learning is performed in terms of four nested loops, where the outermost loop reflects the user-reinforcement-feedback loop, the intermediate two loops switch between different solution modes at symbolic respectively sub-symbolic level, and the innermost loop performs the acquired competences in terms of perception-action cycles. We present a system diagram which explains this process in more detail.
We discuss the learning strategy in terms of learning scenarios provided by the user. This interaction between user ('teacher') and system is a major difference to most existing systems where the system designer places his world model into the system. We believe that this is the key to extendable robust system behavior and successful interaction of humans and artificial cognitive systems.
We furthermore address the issue of bootstrapping the system, and, in particular, the visual recognition module. We give some more in-depth details about our recognition method and how feedback from higher levels is implemented. The described system is however work in progress and no final results are available yet. The available preliminary results that we have achieved so far, clearly point towards a successful proof of the architecture concept.
@inproceedings{diva2:260360,
author = {Felsberg, Michael and Wiklund, Johan and Jonsson, Erik and Moe, Anders and Granlund, Gösta},
title = {{Exploratory Learning Strucutre in Artificial Cognitive Systems}},
booktitle = {International Cognitive Vision Workshop},
year = {2007},
publisher = {eCollections},
address = {Bielefeld},
}
@inproceedings{diva2:260357,
author = {Larsson, Fredrik and Jonsson, Erik and Felsberg, Michael},
title = {{Visual Servoing for Floppy Robots using LWPR}},
booktitle = {RoboMat,2007},
year = {2007},
}
This paper explores the possibility to use a single low-resolution FIR camera for detection of pedestrians in the near zone in front of a vehicle. A low resolution sensor reduces the cost of the system, as well as the amount of data that needs to be processed in each frame.
We present a system that makes use of hot-spots and image positions of a near constant bearing to detect potential pedestrians. These detections provide seeds for an energy minimization algorithm that fits a pedestrian model to the detection. Since false alarms are hard to tolerate, the pedestrian model is then tracked, and the distance-to-collision (DTC) is measured by integrating size change measurements at sub-pixel accuracy, and the car velocity. The system should only engage braking for detections on a collision course, with a reliably measured DTC.
Preliminary experiments on a number of recorded near collision sequences indicate that our method may be useful for ranges up to about 10m using an 80x60 sensor, and somewhat more using a 160x120 sensor. We also analyze the robustness of the evaluated algorithm with respect to dead pixels, a potential problem for low-resolution sensors.
@inproceedings{diva2:260359,
author = {Källhammer, Jan-Erik and Eriksson, Dick and Granlund, Gösta and Felsberg, Michael and Moe, Anders and Johansson, Björn and Wiklund, Johan and Forss\'{e}n, Per-Erik},
title = {{Near Zone Pedestrian Detection using a Low-Resolution FIR Sensor}},
booktitle = {Intelligent Vehicles Symposium, 2007 IEEE},
year = {2007},
series = {Intelligent Vehicles Symposium},
publisher = {IEEE},
address = {Istanbul, Turkey},
}
@inproceedings{diva2:259101,
author = {Felsberg, Michael},
title = {{Extending Graph-Cut to Continuous Value Domain Minimization}},
booktitle = {Canadian Conference on Computer and Robot Vision,2007},
year = {2007},
pages = {274--},
publisher = {IEEE},
address = {Los Alamitos, CA, USA},
}
The estimation of a patch position in an image is a long established but still relevant topic with many applications, e.g. pose estimation and tracking in image sequences. In most systems the position estimate needs to be fused with other estimates, and hence, covariance information is required to weight the different estimates in the right way. In this paper we address the issue with covariance estimation in the case of sum of absolute difference (SAD) block matching. First, we derive the theory for covariance estimation in the case of SAD matching. Second, we evaluate the suggested method in a virtual 3D patch tracking scenario in order to verify the performance in real-world scenarios.
@inproceedings{diva2:259100,
author = {Skoglund, Johan and Felsberg, Michael},
title = {{Covariance estimation for SAD block matching}},
booktitle = {Image Analysis},
year = {2007},
series = {Lecture Notes in Computer Science},
volume = {4522},
pages = {374--382},
publisher = {Springer Berlin/Heidelberg},
}
One problem in appearance-based pose estimation is the need for many training examples, i.e. images of the object in a large number of known poses. Some invariance can be obtained by considering translations, rotations and scale changes in the image plane, but the remaining degrees of freedom are often handled simply by sampling the pose space densely enough. This work presents a method for accurate interpolation between training views using local linear models. As a view representation local soft orientation histograms are used. The derivative of this representation with respect to the image plane transformations is computed, and a Gauss-Newton optimization is used to optimize all pose parameters simultaneously, resulting in an accurate estimate.
@inproceedings{diva2:259102,
author = {Jonsson, Erik and Felsberg, Michael},
title = {{Accurate Interpolation in Appearance-Based Pose Estimation}},
booktitle = {Image Analysis},
year = {2007},
series = {Lecture Notes in Computer Science},
volume = {4522},
pages = {1--10},
publisher = {Springer Berlin/Heidelberg},
}
Although not so often expressed as a modeling problem neither projection nor back-projection can be designed without certain insights in the physics of CT. However, most of this insight is left aside, since it is generally believed that only the most simplified models can be included in the innermost timeconsuming loop in projection and back-projection. We propose that any linear projection procedure should model three functions: The irradiation function, the footprint/basis function, and the gantry rotation function. We demonstrate how a moderately advanced modeling of these three functions can be brought together in an interpolation procedure and yield a surprisingly efficient inner loop interpolation. To this end we i) carefully select a locus of interpolation path through image and projection data spaces and ii) execute multiple convolution as integration by parts implemented by table-look-up.
@inproceedings{diva2:246112,
author = {Danielsson, Per-Erik and Sunnegårdh, Johan},
title = {{Advanced linear modeling and interpolation in CT-reconstruction}},
booktitle = {Proceedings of the Ninth International Meeting on Fully Three-dimensional Image Reconstruction in Radiology and Nuclear Medicine, Lindau, Germany, July 9-13, 2007},
year = {2007},
}
In this paper we propose a new approach to real-time view-based object recognition and scene registration. Object recognition is an important sub-task in many applications, as e.g., robotics, retrieval, and surveillance. Scene registration is particularly useful for identifying camera views in databases or video sequences. All of these applications require a fast recognition process and the possibility to extend the database with new material, i.e., to update the recognition system online. The method that we propose is based on P-channels, a special kind of information representation which combines advantages of histograms and local linear models. Our approach is motivated by its similarity to information representation in biological systems but its main advantage is its robustness against common distortions as clutter and occlusion. The recognition algorithm extracts a number of basic, intensity invariant image features, encodes them into P-channels, and compares the query P-channels to a set of prototype P-channels in a database. The algorithm is applied in a cross-validation experiment on the COIL database, resulting in nearly ideal ROC curves. Furthermore, results from scene registration with a fish-eye camera are presented.
@inproceedings{diva2:241583,
author = {Felsberg, Michael and Hedborg, Johan},
title = {{Real-Time Visual Recognition of Objects and Scenes Using P-Channel Matching}},
booktitle = {Proceedings 15th Scandinavian Conference on Image Analysis},
year = {2007},
series = {Lecture Notes in Computer Science},
volume = {4522},
pages = {908--917},
publisher = {Springer},
address = {Berlin, Heidelberg},
}
The GPU is the main processing unit on a graphics card. A modern GPU typically provides more than ten times the computational power of an ordinary PC processor. This is a result of the high demands for speed and image quality in computer games. This paper investigates the possibility of exploiting this computational power for tracking points in image sequences. Tracking points is used in many computer vision tasks, such as tracking moving objects, structure from motion, face tracking etc. The algorithm was successfully implemented on the GPU and a large speed up was achieved.
@inproceedings{diva2:241567,
author = {Hedborg, Johan and Skoglund, Johan and Felsberg, Michael},
title = {{KLT Tracking Implementation on the GPU}},
booktitle = {Proceedings SSBA 2007},
year = {2007},
}
@inproceedings{diva2:265750,
author = {Felsberg, Michael},
title = {{Optical flow estimation from monogenic phase.}},
booktitle = {International Workshop on Complex Motion,2004},
year = {2006},
publisher = {Springer},
}
@inproceedings{diva2:265748,
author = {Jonsson, Erik and Felsberg, Michael},
title = {{Soft Histograms for Belief Propagation}},
booktitle = {ECCV Workhop of the Representation and Use of Prior Knowledge in Vision,2006},
year = {2006},
}
@inproceedings{diva2:265747,
author = {Jonsson, Erik and Felsberg, Michael},
title = {{Correspondence-Free Associative Learning}},
booktitle = {ICPR,2006},
year = {2006},
}
@inproceedings{diva2:265746,
author = {Felsberg, Michael and Granlund, Gösta},
title = {{P-Channels:
Robust Multivariate M-Estimation of Large Datasets}},
booktitle = {ICPR,2006},
year = {2006},
}
@inproceedings{diva2:265745,
author = {Skoglund, Johan and Felsberg, Michael},
title = {{Evaluation of Subpixel Tracking Algorithms}},
booktitle = {International Symposium on Visual Computing,2006},
year = {2006},
pages = {375--},
}
In order to insert a virtual object into a TV image, the graphics system needs to know precisely how the camera is moving, so that the virtual object can be rendered in the correct place in every frame. Nowadays this can be achieved relatively easily in postproduction, or in a studio equipped with a special tracking system. However, for live shooting on location, or in a studio that is not specially equipped, installing such a system can be difficult or uneconomic. To overcome these limitations, the MATRIS project is developing a real-time system for measuring the movement of a camera. The system uses image analysis to track naturally occurring features in the scene, and data from an inertial sensor. No additional sensors, special markers, or camera mounts are required. This paper gives an overview of the system and presents some results.
@inproceedings{diva2:259695,
author = {Chandaria, Jigna and Thomas, Graham and Bartczak, Bogumil and Koeser, Kevin and Koch, Reinhard and Becker, Mario and Bleser, Gabriele and Stricker, Didier and Wohlleber, Cedric and Felsberg, Michael and Gustafsson, Fredrik and Hol, Jeroen and Schön, Thomas and Skoglund, Johan and Slycke, Per and Smeitz, Sebastiaan},
title = {{Real-Time Camera Tracking in the MATRIS Project}},
booktitle = {Prcoeedings of the 2006 International Broadcasting Convention},
year = {2006},
}
In this paper we make use of the idea that a robot can autonomously discover objects and learn their appearances by poking and prodding at interesting parts of a scene. In order to make the resultant object recognition ability more robust, and discriminative, we replace earlier used colour histogram features with an invariant texture-patch method. The texture patches are extracted in a similarity invariant frame which is constructed from short colour contour segments. We demonstrate the robustness of our invariant frames with a repeatability test under general homography transformations of a planar scene. Through the repeatability test, we find that defining the frame using using ellipse segments instead of lines where this is appropriate improves repeatability. We also apply the developed features to autonomous learning of object appearances, and show how the learned objects can be recognised under out-of-plane rotation and scale changes.
@inproceedings{diva2:258028,
author = {Forss\'{e}n, Per-Erik and Moe, Anders},
title = {{Autonomous Learning of Object Appearances using Colour Contour Frames}},
booktitle = {3rd Canadian Conference on Computer and Robot Vision, CRV06, Qu\'{e}bec City, Qu\'{e}bec, Canada},
year = {2006},
pages = {3--3},
publisher = {IEEE Computer Society},
address = {Québec, Canada},
}
We have developed a system which integrates the information output from several pose estimation algorithms and from several views of the scene. It is tested in a real setup with a robotic manipulator. It is shown that integrating pose estimates from several algorithms increases the overall performance of the pose estimation accuracy as well as the robustness as compared to using only a single algorithm. It is shown that increased robustness can be achieved by using pose estimation algorithms based on complementary features, so called algorithmic multi-cue integration (AMC). Furthermore it is also shown that increased accuracy can be achieved by integrating pose estimation results from different views of the scene, so-called temporal multi-cue integration (TMC). Temporal multi-cue integration is the most interesting aspect of this paper.
@inproceedings{diva2:258029,
author = {Viksten, Fredrik and Söderberg, Robert and Nordberg, Klas and Perwass, Christian},
title = {{Increasing Pose Estimation Performance using Multi-cue Integration}},
booktitle = {IEEE International Conference on Robotic and Automation (ICRA)},
year = {2006},
series = {Robotics and Automation},
pages = {3760--3767},
publisher = {IEEE},
}
@inproceedings{diva2:257166,
author = {Forss\'{e}n, Per-Erik and Johansson, Björn and Granlund, Gösta},
title = {{Channel Associative Networks for Multiple Valued Mappings}},
booktitle = {2nd International Cognitive Vision Workshop},
year = {2006},
pages = {4--11},
}
@inproceedings{diva2:245985,
author = {Johansson, Björn and Wiklund, Johan and Granlund, Gösta},
title = {{Goals and status within the IVSS project}},
booktitle = {Seminar on "Cognitive vision in traffic analyses"},
year = {2006},
}
Traditionally, quadrature filters and derivatives have been considered as alternative approaches to low-level image analysis. In this paper we show that there actually exist close connections: We define the quadrature-based boundary tensor and the derivative-based gradient energy tenser which exhibit very similar behavior. We analyse the reason for this and determine how to minimize the difference. These insights lead to a simple and very efficient integrated feature detection algorithm.
@inproceedings{diva2:269099,
author = {Köthe, Ullrich and Felsberg, Michael},
title = {{Riesz-transforms versus derivatives:
On the relationship between the boundary tensor and the energy tensor}},
booktitle = {Scale Space and PDE Methods in Computer Vision},
year = {2005},
series = {Lecture Notes in Computer Science},
volume = {3459},
pages = {179--191},
}
In this paper we propose a new operator which combines advantages of monogenic scale-space and Gaussian scale-space, of the monogenic signal and the structure tensor. The gradient energy tensor (GET) defined in this paper is based on Gaussian derivatives up to third order using different scales. These filters are commonly available, separable, and have an optimal uncertainty. The response of this new operator can be used like the monogenic signal to estimate the local amplitude, the local phase, and the local orientation of an image, but it also allows to measure the coherence of image regions as in the case of the structure tensor. Both theoretically and in experiments the new approach compares favourably with existing methods.
@inproceedings{diva2:269100,
author = {Felsberg, Michael and Köthe, Ullrich},
title = {{GET:
The connection between monogenic scale-space and Gaussian derivatives}},
booktitle = {Scale Space and PDE Methods in Computer Vision},
year = {2005},
series = {Lecture Notes in Computer Science},
volume = {3459},
pages = {192--203},
}
The channel representation allows the construction of soft histograms, where peaks can be detected with a much higher accuracy than in regular hard-binned histograms. This is critical in e.g. reducing the number of bins of generalized Hough transform methods. When applying the maximum entropy method to the channel representation, a minimum-information reconstruction of the underlying continuous probability distribution is obtained. The maximum entropy reconstruction is compared to simpler linear methods in some simulated situations. Experimental results show that mode estimation of the maximum entropy reconstruction outperforms the linear methods in terms of quantization error and discrimination threshold. Finding the maximum entropy reconstruction is however computationally more expensive.
@inproceedings{diva2:269086,
author = {Jonsson, Erik and Felsberg, Michael},
title = {{Reconstruction of probability density functions from channel representations}},
booktitle = {Scandinavian Conference on Image Analysis},
year = {2005},
series = {Lecture Notes in Computer Science},
volume = {3540},
pages = {491--500},
}
This paper describes a novel compact representation of local features called the tensor doublet. The representation generates a four dimensional feature vector which is significantly less complex than other approaches, such as Lowe's 128 dimensional feature vector. Despite its low dimensionality, we demonstrate here that the tensor doublet can be used for pose estimation, where the system is trained for an object and evaluated on images with cluttered background and occlusion.
@inproceedings{diva2:269094,
author = {Söderberg, Robert and Nordberg, Klas and Granlund, Gösta},
title = {{An Invariant and Compact Representation for Unrestricted Pose Estimation}},
booktitle = {Second Iberian Conference Pattern Recognition and Image Analysis (IbPRIA)},
year = {2005},
series = {LNCS},
volume = {3522},
publisher = {Springer},
address = {Berlin / Heidelberg},
}
In this paper, we combine the well-established technique of Wiener filtering with an efficient method for robust smoothing: channel smoothing. The main parameters to choose in channel smoothing are the number of channels and the averaging filter. Whereas the number of channels has a natural lower bound given by the noise level and should for the sake of speed be as small as possible, the averaging filter is a less obvious choice. Based on the linear behavior of channel smoothing for inlier noise, we derive a Wiener filter applicable for averaging the channels of an image. We show in some experiments that our method compares favorable with established methods.
@inproceedings{diva2:269038,
author = {Felsberg, Michael},
title = {{Wiener channel smoothing:
Robust Wiener filtering of images}},
booktitle = {Pattern Recognition},
year = {2005},
series = {Lecture Notes in Computer Science},
volume = {3663},
pages = {468--475},
}
In this paper we briefly review a not so well known quadratic, phase invariant image processing operator, the energy operator, and describe its tensor-valued generalization, the energy tensor. We present relations to the real-valued and the complex valued energy operators and discuss properties of the three operators. We then focus on the discrete implementation for estimating the tensor based on Teager’s algorithm and frame theory. The kernels of the real-valued and the tensor-valued operators are formally derived. In a simple experiment we compare the energy tensor to other operators for orientation estimation. The paper is concluded with a short outlook to future work.
@inproceedings{diva2:269039,
author = {Felsberg, Michael and Jonsson, Erik},
title = {{Energy Tensors:
Quadratic, Phase Invariant Image Operators}},
booktitle = {Pattern Recognition},
year = {2005},
series = {Lecture Notes in Computer Science},
volume = {3663},
pages = {493--500},
publisher = {Springer Berlin/Heidelberg},
}
Contemporaryreconstruction for helical cone-beam CT is mostly based onnon-exact algorithms, which produce more or less unacceptableartifacts for cone angles above a certain limit. We report onattempts to extend the applicability of these algorithms to highercone angles by suppressing artifacts by means of iterativepost-processing. The iterative loop includes a ramp-filtering stepbefore back-projection, which promotes fast convergence. The schemehas been applied to the original PI-method as well as to Siemens'AMPR and WFBP methods. Using ordered subsets in the iterative loopfor WFBP, we achieved almost spotless images in one singleiteration for cone angles \pm 9degrees.
@inproceedings{diva2:263520,
author = {Sunnegårdh, Johan and Danielsson, Per-Erik and Magnusson, Maria},
title = {{Iterative Improvement of Non-Exact Reconstruction in Cone-Beam CT}},
booktitle = {Fully 3D 2005, Eighth International Meeting on Fully Three-dimensional Image Reconstruction in Radiology and Nuclear Medicine,2005},
year = {2005},
}
This largelytutorial treatise presents a Fourier based model for 2D-projection,the latter being a most important ingredient in any iterativereconstruction method. For sampled images the model requires anassumed basis function, which implicitly defines the necessarywindow and interpolation functions. We unravel the basis and windowfunctions for some projection techniques described as procedures.Circular symmetric basis functions make it simple to findinterpolation coefficients but require well tuned interpolationfunctions to avoid aliasing. We find it unnecessary to distinguishbetween voxel and ray driven projection. These two techniquesconcern only the innermost loop and both can be applied to anyinterpolation function, and to projection and back-projectionalike.
@inproceedings{diva2:263519,
author = {Danielsson, Per-Erik and Magnusson, Maria and Sunnegårdh, Johan},
title = {{Basis and window functions in CT}},
booktitle = {Fully 3D 2005, Eighth International Meeting on Fully Three-dimensional Image Reconstruction in Radiology and Nuclear Medicine,2005},
year = {2005},
}
Contemporary analytical reconstruction methods for helical cone-beam CT have to be designed to handle the Long Object Problem. Normally, a moderate amount of over-scanning is sufficient for reconstruction of a certain Region-of-interest (ROI). Unfortunately, for iterative methods, it seems that the useful ROI will diminish for every iteration step. The remedies proposed here are twofold. Firstly, we use careful extrapolation and masking of projection data. Secondly, we generate and utilize projection data from incompletely reconstructed volume parts, which is rather counter-intuitive and contradictory to our initial assumptions. The results seem very encouraging. Even voxels close to the boundary in the original ROI are as well enhanced by the iterative loop as the middle part.
@inproceedings{diva2:263518,
author = {Magnusson, Maria and Danielsson, Per-Erik and Sunnegårdh, Johan},
title = {{Handling of Long Objects in Iterative Reconstruction from Helical Cone-Beam Projections}},
booktitle = {Fully 3D 2005, Eighth International Meeting on Fully Three-dimensional Image Reconstruction in Radiology and Nuclear Medicine,2005},
year = {2005},
}
In this paper we discuss the benefits of writing code for a specific processor and exploiting all its capabilities. We shows that in some situations it is possible to significantly reduce the time consumption by using SSE2, a Single Instruction Multiple Data (SIMD) extension available in new Pentium processors. Speed of the Harris operator is used for evaluation. All experiments are run on a Pentium 4 and the results are compared between ordinary C-code and code using SSE2. The purpose is not only to achieve a significant speed-up of the code, but also to benefit from SSE2 code with the least possible programming effort.
@inproceedings{diva2:258134,
author = {Skoglund, Johan and Felsberg, Michael},
title = {{Fast Image Processing Using SSE2}},
booktitle = {Fast Image Processing Using SSE2,2005},
year = {2005},
}
A robust mean value is often a good alternative to the standard mean value when dealing with data containing many outliers. An efficient method for samples of one-dimensional features and the truncated quadratic error norm is presented and compared to the method of channel averaging (soft histograms).
@inproceedings{diva2:258133,
author = {Jonsson, Erik and Felsberg, Michael},
title = {{Efficient Robust Mean Value Computation of 1D Features}},
booktitle = {Efficient Robust Mean Value Computation of 1D Features,2005},
year = {2005},
}
@inproceedings{diva2:258130,
author = {Moe, Anders},
title = {{Local Single-Patch Features for Pose Estimation Using the Log-Polar Transform}},
booktitle = {Local Single-Patch Features for Pose Estimation Using the Log-Polar Transform,2005},
year = {2005},
}
The channel representation allows the construction of soft histograms, where peaks can be detected with a much higher accuracy than in regular hard-binned histograms. This is critical in e.g. reducing the number of bins of generalized Hough transform methods. When applying the maximum entropy method to the channel representation, a minimum-information reconstruction of the underlying continuous probability distribution is obtained. The maximum entropy reconstruction is compared to simpler linear methods in some simulated situations. Experimental results show that mode estimation of the maximum entropy reconstruction outperforms the linear methods in terms of quantization error and discrimination threshold. Finding the maximum entropy reconstruction is however computationally more expensive.
@inproceedings{diva2:258129,
author = {Felsberg, Michael and Jonsson, Erik},
title = {{Reconstruction of Probability Density Functions from Channel Representations}},
booktitle = {Reconstruction of Probability Density Functions from Channel Representations,2005},
year = {2005},
}
To program a robot to solve a simple shape-sorter puzzle is trivial. To devise a Cognitive System Architecture, which allows the system to find out by itself how to go about a solution, is less than trivial. The development of such an architecture is one of the aims of the COSPAL project, leading to new techniques in vision based Artificial Cognitive Systems, which allow the development of robust systems for real dynamic environments. The systems developed under the project itself remain however in simplified scenarios, likewise the shape-sorter problem described in the present paper. The key property of the described system is its robustness. Since we apply association strategies of local features, the system behaves robustly under a wide range of distortions, as occlusion, colour and intensity changes. The segmentation step which is applied in many systems known from literature is replaced with local associations and view-based hypothesis validation. The hypotheses used in our system are based on the anticipated state of the visual percepts. This state replaces explicit modeling of shapes. The current state is chosen by a voting system and verified against the true visual percepts. The anticipated state is obtained from the association to the manipulator actions, where reinforcement learning replaces the explicit calculation of actions. These three differences to classical schemes allow the design of a much more generic and flexible system with a high level of robustness. On the technical side, the channel representation of information and associative learning in terms of the channel learning architecture are essential ingredients for the system. It is the properties of locality, smoothness, and non-negativity which make these techniques suitable for this kind of application. The paper gives brief descriptions of how different system parts have been implemented and show some examples from our tests.
@inproceedings{diva2:258036,
author = {Felsberg, Michael and Forss\'{e}n, Per-Erik and Moe, Anders and Granlund, Gösta},
title = {{A COSPAL Subsystem:
Solving a Shape-Sorter Puzzle}},
booktitle = {AAAI Fall Symposium: From Reactive to Anticipatory Cognitive Embedded Systems, FS-05-05},
year = {2005},
pages = {65--69},
publisher = {AAAI Press},
}
@inproceedings{diva2:257167,
author = {Johansson, Björn and Moe, Anders},
title = {{Patch-Duplets for Object Recognition and Pose Estimation}},
booktitle = {2nd Canadian Conference on Computer and Robot Vision,2005},
year = {2005},
}
This paper describes a method for extracting point features from an image, corresponding to corners and crossings of lines. The method is based on a local estimation of a 6 x 6 tensor which describes the parameters of a pair of line segments. By considering the rank of the tensor, it is possible to find points of interest. These points can then be further analyzed to provide detailed information about the configuration of the segments. The proposed method is intended for features which can be used for estimation of position and pose of 3D objects, e.g., for the purpose of grasping.
@inproceedings{diva2:250334,
author = {Nordberg, Klas and Söderberg, Robert},
title = {{Detection and representation of complex local features}},
booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
year = {2005},
pages = {257--260},
}
In this paper we address one of the standard problems of image processing and computer vision: The detection of points of interest (POI). We propose two new approaches for improving the detection results. First, we define an energy tensor which can be considered as a phase invariant extension of the structure tensor. Second, we use the channel representation for robustly clustering the POI information from the first step resulting in sub-pixel accuracy for the localisation of POI. We compare our method to several related approaches on a theoretical level and show a brief experimental comparison to the Harris detector.
@inproceedings{diva2:269177,
author = {Felsberg, Michael and Granlund, Gösta},
title = {{POI detection using channel clustering and the 2D energy tensor}},
booktitle = {Proceedings of Pattern Recognition, 26th DAGM Symposium},
year = {2004},
series = {Lecture Notes in Computer Science},
volume = {3175},
pages = {103--110},
publisher = {SpringerLink},
}
@inproceedings{diva2:262435,
author = {Krüger, Norbert and Felsberg, Michael and Wörgötter, Florentin},
title = {{Processing Multi-modal Primitives from Image Sequences}},
booktitle = {EIS2004,2004},
year = {2004},
}
@inproceedings{diva2:262432,
author = {Kalkan, Sinan and Calow, D. and Felsberg, Michael and Wörgötter, Florentin and Lappe, M. and Krüger, Norbert},
title = {{Optic Flow Statistics and Intrinsic Dimensionality}},
booktitle = {BICS2004,2004},
year = {2004},
}
Epipolar geometry is the geometry situation of two cameras depicting the same scene. For un-calibrated cameras epipolar geometry is compactly described by the fundamental matrix. Estimation of the fundamental matrix is trivial if we have a set of corresponding points in the two images. Corresponding points are often found using e.g. the Harris interest point detector, but there are several advantages with using richer features instead. In this paper we will use blob features. Blobs are homogeneous regions which are compactly described by their colour, area, centroid and inertia matrix. Using blobs to establish correspondences is fast, and the extra information besides position, allows us to reject false matches more accurately.
@inproceedings{diva2:258143,
author = {Forss\'{e}n, Per-Erik and Moe, Anders},
title = {{Blobs in Epipolar Geometry}},
booktitle = {Blobs in Epipolar Geometry,2004},
year = {2004},
pages = {82--85},
}
The paper describes a method for extracting point features from an image, corresponding to corners and crossings of lines. The method is based on a fourth order tensor representation which can describe the parameters of a local pair of line segments. By considering the rank of the tensor, it is possible to find points which correspond to corners, crossings or junctions. These points can then be further analyzed to provide detailed information about the configuration of the segments. The proposed method is intended for features which can be used for estimation of position and pose of 3D objects, e.g., for the purpose of grasping.
@inproceedings{diva2:258144,
author = {Nordberg, Klas and Söderberg, Robert},
title = {{Detection and estimation of features for estimation of position}},
booktitle = {Swedish Symposium on Image Analysis (SSBA)},
year = {2004},
pages = {74--77},
}
A novel method for estimating a second order scene tensor is described and results using that method on a synthetic image sequence are shown. It is shown that the tensors can be used to represent basic geometrical entities. A short discussion on what work needs to be done to extend the tensorial description here in to a framework of pose estimation is found at the end of the report.
@inproceedings{diva2:258145,
author = {Nordberg, Klas and Viksten, Fredrik},
title = {{Estimation of a tensor based representation for geometrical 3D primitives based on motion stereo}},
booktitle = {Swedish Symposium on Image Analysis (SSBA)},
year = {2004},
pages = {13--16},
}
This paper presents a novel representation for 3D shapes in terms of planar surface patches and their boundaries. The representation is based on a tensor formalism similar to the usual orientation tensor but extends this concept by using projective spaces and a fourth order tensor, even though the practical computations can be made in normal matrix algebra. This paper also discusses the possibility of estimating the proposed representation from motion field which are generated by a calibrated camera moving in the scene. One method based on 3D spatio-temporal orientation tensors is presented and results from this method are included.
@inproceedings{diva2:258137,
author = {Nordberg, Klas and Viksten, Fredrik},
title = {{Motion based estimation and representation of 3D surfaces and boundaries}},
booktitle = {International Workshop on Complex Motion (IWCM)},
year = {2004},
series = {LNCS},
volume = {3417},
publisher = {Berlin / Heidelberg},
address = {Springer},
}
@inproceedings{diva2:257168,
author = {Johansson, Björn and Moe, Anders},
title = {{Patch-Duplets for Object Recognition and Pose Estimation}},
booktitle = {Proceedings SSBA04 Symposium on Image Analysis,2004},
year = {2004},
pages = {78--81},
}
This paper presents the new project Efficient Convolution Operators for Image Processing of Volumes and Volume Sequences . The project is carried out in collaboration with Context Vision AB.
By using sequential filtering on 3D and 4D data, the number of nonzero filter coefficients for a desired filter set can be significantly reduced. A sequential convolution structure in combination with a convolver designed for sparse filters is a powerful tool for filtering of multidimensional signals.
The project mainly concerns the design of filter networks, that approximate a desired filter set, while keeping the computational load as low as possible. This is clearly an optimization problem, but it can be formulated in several different ways due to the complexity.
The project is in an initial state and the paper focuses on experiences from prior work and discuss possible approaches for the future progress.
@inproceedings{diva2:242568,
author = {Svensson, Björn and Andersson, Mats and Wiklund, Johan and Knutsson, Hans},
title = {{Issues on filter networks for efficient convolution}},
booktitle = {Proceedings of the Swedish Symposium on Image Analysis (2004)},
year = {2004},
pages = {94--97},
address = {Uppsala},
}
This paper presents a proposal (not entirely new) for combining analytical and algebraic reconstruction techniques. Such a combination bears the promise to improve the image quality of fast but non-exact, reconstruction of the filtered backprojection type. The difference between the present proposal and traditional ART is that we compute a full error image with FBP, applied to projection differences, to update the solution in each iteration step. The main road-block seems to be the same that has been an obstacle for many ART-algorithms in CT applications, namely that the forward projections are subjected to ailiasing, which tend to override the intended benefits of the updating loop. We present an analysis of this problem and indicate some possible solutions.
@inproceedings{diva2:273856,
author = {Danielsson, Per-Erik and Magnusson Seger, Maria},
title = {{A Proposal for Combining FBP and ART in CT-reconstruction}},
booktitle = {Proceedings of the Seventh International Meeting on Fully Three-dimensional Image Reconstruction in Radiology and Nuclear Medicine, St Malo, France, June 30 - July 4, 2003},
year = {2003},
}
This paper presents a novel two-frame motion estimation algorithm. The first step is to approximate each neighborhood of both frames by quadratic polynomials, which can be done efficiently using the polynomial expansion transform. From observing how an exact polynomial transforms under translation a method to estimate displacement fields from the polynomial expansion coefficients is derived and after a series of refinements leads to a robust algorithm. Evaluation on the Yosemite sequence shows good results.
@inproceedings{diva2:273847,
author = {Farnebäck, Gunnar},
title = {{Two-Frame Motion Estimation Based on Polynomial Expansion}},
booktitle = {SCIA13},
year = {2003},
series = {Lecture Notes in Computer Science},
volume = {2749},
pages = {363--370},
}
This paper presents experiments on using integer arithmetic with the channel representation. Integer arithmetic allows reduction of memory requirements, and allows efficient implementations using machine code vector instructions, integer-only CPUs, or dedicated programmable hardware such as FPGAs possible. We demonstrate the effects of discretisation on a non-iterative robust estimation technique called channel smoothing, but the results are also valid for other applications.
@inproceedings{diva2:273827,
author = {Forssen, Per-Erik},
title = {{Channel Smoothing using Integer Arithmetic}},
booktitle = {Proceedings SSAB03 Symposium on Image Analysis},
year = {2003},
}
In this paper we present a method to implement the monogenic scale space on a bounded domain and show some applications. The monogenic scale space is a vector valued scale space based on the Poisson scale space, which establishes a sophisticated alternative to the Gaussian scale space. The features of the monogenic scale space, including local amplitude, local phase, local orientation, local frequency, and phase congruency, are much easier to interpret in terms of image features evolving through scale than in the Gaussian case. Furthermore, applying results from harmonic analysis, relations between the features are obtained which improve the understanding of image analysis. As applications, we present a very simple but still accurate approach to image reconstruction from local amplitude and local phase and a method for extracting the evolution of lines and edges through scale.
@inproceedings{diva2:269478,
author = {Felsberg, Michael and Duits, R and Florack, L},
title = {{The monogenic scale space on a bounded domain and its applications}},
booktitle = {Scale Spac ´03, Eds Griffin, L. D. and Lillholm, M},
year = {2003},
}
We present a two-dimensional information representation, where small but overlapping Gaussian kernels are used to encode the data in a matrix. Apart from points we apply this to constraints that restrict the solution to a linear subspace. A localised decoding scheme accurately extracts multiple solutions together with an estimate of the covariances. We employ the method in optical flow computations to determine multiple velocities occurring at motion discontinuities.
@inproceedings{diva2:269470,
author = {Spies, Hagen and Forss\'{e}n, Per-Erik},
title = {{Two-dimensional channel representation for multiple velocities}},
booktitle = {Proceedings of the 13th Scandinavian Conference of Image Analysis, SCIA 2003},
year = {2003},
series = {Lecture Notes in Computer Science},
volume = {2749},
pages = {356--362},
publisher = {SpringerLink},
address = {Berlin, Heidelberg},
}
This paper presents a method for detection of homogeneous regions in grey-scale images, representing them as blobs. In order to be fast, and not to favour one scale over others, the method uses a scale pyramid. In contrast to most multi-scale methods this one is non-linear, since it employs robust estimation rather than averaging to move through scale-space. This has the advantage that adjacent and partially overlapping clusters only affect each other's shape, not each other's values. It even allows blobs within blobs, to provide a pyramid blob structure of the image.
@inproceedings{diva2:269468,
author = {Forss\'{e}n, Per-Erik and Granlund, Gösta},
title = {{Robust multi-scale extraction of blob features}},
booktitle = {Proceedings or the 13th Scandinavian Conference, SCIA 2003},
year = {2003},
series = {Lecture Notes in Computer Science},
volume = {2749/2003},
pages = {769--769},
publisher = {Springer Berlin/Heidelberg},
address = {Berlin, Heidelberg},
}
@inproceedings{diva2:265769,
author = {Scharr, Hanno and Felsberg, Michael and Forss\'{e}n, Per-Erik},
title = {{Noise Adaptive Channel Smoothing of Low-Dose Images}},
booktitle = {Computer Vision for the Nano-Scale Workshop accompanying CVPR 2003,2003},
year = {2003},
publisher = {IEEE Computer Society},
address = {Madison},
}
@inproceedings{diva2:257169,
author = {spies, hagen and Johansson, Björn},
title = {{Directional Channel Representation for Multiple Line-Endings and Intensity Levels}},
booktitle = {Proceedings of IEEE International Conference on Image Processing,2003},
year = {2003},
pages = {265--268},
}
Channel smoothing is an alternative to diffusion filtering for robust estimation of image features. Its main advantages are speed, stability with respect to parameter changes, and a simple implementation. However, channel smoothing becomes instable in certain situations, typically for elongated, periodic patterns like for instance fingerprints. As for the diffusion filtering an anisotropic extension is required in these cases. In this paper we introduce a new method for anisotropic channel smoothing which is comparable to coherence enhancing diffusion, but faster and easier to implement. Anisotropic channel smoothing implements an orientation adaptive non-linear filtering scheme as a special case of adaptive channel filtering. The smoothing algorithm is applied to several fingerprint images and the results are compared to those of coherence enhancing diffusion.
@inproceedings{diva2:246056,
author = {Felsberg, Michael and Granlund, Gösta},
title = {{Anisotropic Channel Filtering}},
booktitle = {SCIA},
year = {2003},
series = {Lecture Notes in Computer Science},
volume = {2749},
pages = {755--762},
}
In this paper we address the problem of appropriately representing the intrinsic dimensionality of image neighborhoods. This dimensionality describes the degrees of freedom of a local image patch and it gives rise to some of the most often applied corner and edge detectors. It is common to categorize the intrinsic dimensionality (iD) to three distinct cases: i0D, i1D, and i2D. Real images however contain combinations of all three dimensionalities which has to be taken into account by a continuous representation. Based on considerations of the structure tensor, we derive a cone-shaped iD-space which leads to a probabilistic point of view to the estimation of intrinsic dimensionality.
@inproceedings{diva2:246047,
author = {Felsberg, Michael and Kruger, Norbert},
title = {{A Probabilistic Definition of Intrinsic Dimensionality for Images}},
booktitle = {25. DAGM Symposium Mustererkennung, Magdeburg eds Michaelis, B. and Krell, G.},
year = {2003},
series = {Lecture Notes in Computer Science},
volume = {2781},
pages = {140--147},
}
@inproceedings{diva2:241579,
author = {Kruger, Norbert and Felsberg, Michael},
title = {{A continuous Formulation of intrinsic Dimension}},
booktitle = {British Machine Vision Conference},
year = {2003},
}
The paper makes a short presentation of three existing methods for estimation of orientation tensors, the so-called structure tensor, quadrature filter based techniques, and techniques based on approximating a local polynomial model. All three methods can be used for estimating an orientation tensor which in the 3D case can be used for motion estimation. The methods are based on rather different approaches in terms of the underlying signal models. However, they produce more or less similar results which indicates that there should be a common framework for estimation of the tensors. Such a framework is proposed, in terms of a second order mapping from signal to tensor with additional conditions on the mapping. It it also shown that the three methods in principle fall into this framework.
@inproceedings{diva2:241568,
author = {Nordberg, Klas and Farnebäck, Gunnar},
title = {{A Framework for Estimation of Orientation and Velocity}},
booktitle = {International Conference on Image Processing (ICIP)},
year = {2003},
}
We consider alpha scale spaces, a parameterized class (alpha is an element of (0, 1]) of scale space representations beyond the well-established Gaussian scale space, which are generated by the alpha-th power of the minus Laplace operator on a bounded domain using the Neumann boundary condition. The Neumann boundary condition ensures that there is no grey-value flux through the boundary. Thereby no artificial grey-values from outside the image affect the evolution proces, which is the case for the alpha scale spaces on an unbounded domain. Moreover, the connection between the a scale spaces which is not trivial in the unbounded domain case, becomes straightforward: The generator of the Gaussian semigroup extends to a compact, self-adjoint operator on the Hilbert space L-2(Omega) and therefore it has a complete countable set of eigen functions. Taking the alpha-th power of the Gaussian generator simply boils down to taking the alpha-th power of the corresponding eigenvalues. Consequently, all alpha scale spaces have exactly the same eigen-modes and can be implemented simultaneously as scale dependent Fourier series. The only difference between them is the (relative) contribution of each eigen-mode to the evolution proces. By introducing the notion of (non-dimensional) relative scale in each a scale space, we are able to compare the various alpha scale spaces. The case alpha = 0.5, where the generator equals the square root of the minus Laplace operator leads to Poisson scale space, which is at least as interesting as Gaussian scale space and can be extended to a (Clifford) analytic scale space.
@inproceedings{diva2:241564,
author = {Duits, Remco and Felsberg, Michael and Florack, Luc and Platel, Bram},
title = {{$\alpha$ Scale Spaces on a Bounded Domain}},
booktitle = {Scale Space Methods in Computer Vision},
year = {2003},
series = {Lecture Notes in Computer Science},
volume = {2695},
pages = {502--518},
}
We present a computational framework that extends classical image velocity estimation to include more general parameters of dynamic brightness changes. The introduced method allows for an extraction of these parameters, ranging from models of linear illumination changes over diffusion and decay constants to expansion rates. We illustrate the benefit of such an extension on a real image sequence with illumination changes. We also introduce a new depth estimation technique termed depth from diffusion and apply it to some real examples.
@inproceedings{diva2:302471,
author = {Spies, Hagen and Dierig, Tobias and Garbe, Christoph S.},
title = {{Local Models for Dynamic Processes in Image Sequences}},
booktitle = {Workshop Dynamic Perception},
year = {2002},
pages = {59--64},
address = {Bochum},
}
Disparity estimation is a fundamental problem of computer vision. Besides other approaches, disparity estimation from phase information is a quite wide-spread technique. In the present paper, we have considered the influence of the involved quadrature filters and we have replaced them with filters based on the monogenic signal. The implemented algorithm makes use of a scale-pyramid and applies channel encoding for the representation and fusion of the estimated data. The performed experiments show a significant improvement of the results.
@inproceedings{diva2:273836,
author = {Felsberg, Michael},
title = {{Disparity from monogenic phase}},
booktitle = {DAGM Symposium Mustererkennung, Zurich},
year = {2002},
series = {Lecture Notes in Computer Science},
volume = {2449},
pages = {248--256},
publisher = {Springer, Heidelberg},
}
This paper presents and overview of the basic and applied research carried out by the Computer Vision Laboratory, Linköping University, in the WITAS UAV Project. This work includes customizing and redesigning vision methods to fit the particular needs and restrictions imposed by the UAV platform, e.g., for low-level vision, motion estimation, navigation, and tracking. It also includes a new learning structure for association of perception-action activations, and a runtime system for implementation and execution of vision algorithms. The paper contains also a brief introduction to the WITAS UAV Project.
@inproceedings{diva2:262461,
author = {Nordberg, Klas and Doherty, Patrick and Farnebäck, Gunnar and Forss\'{e}n, Per-Erik and Granlund, Gösta and Moe, Anders and Wiklund, Johan},
title = {{Vision for a UAV helicopter}},
booktitle = {International Conference on Intelligent Robots and Systems (IROS), Workshop on Aerial Robotics},
year = {2002},
}
@inproceedings{diva2:257170,
author = {Johansson, Björn and Farnebäck, Gunnar},
title = {{A Theoretical Comparison of Different Orientation Tensors}},
booktitle = {Proceedings SSAB02 Symposium on Image Analysis,2002},
year = {2002},
pages = {69--73},
}
Recently substantial research has been devoted to Unmanned Aerial Vehicles (UAVs). One of a UAV's most demanding subsystem is vision. The vision subsystem must dynamically combine different algorithms as the UAVs goal and surrounding change. To fully utilize the available hardware, a run time system must be able to vary the quality and the size of regions the algorithms are applied to, as the number of image processing tasks changes. To allow this the run time system and the underlying computational model must be integrated. In this paper we present a computational model suitable for integration with a run time system. The computational model is called Image Processing Data Flow Graph (IP-DFG). IP-DFG has been developed for modeling of complex image processing algorithms. IP-DFG is based on data flow graphs, but has been extended with hierarchy and new rules for token consumption, which makes the computational model more flexible and more suitable for human interaction. In this paper we also show that IP-DFGs are suitable for modelling expressions, including data dependent decisions and iterations, which are common in complex image processing algorithms.
@inproceedings{diva2:246033,
author = {Andersson, Per and Kuchcinski, Krzysztof and Nordberg, Klas and Doherty, Patrick},
title = {{Integrating a computational model and a run time system for image processing on a UAV}},
booktitle = {Euromicro Symposium on Digital System Design (DSD)},
year = {2002},
pages = {102--109},
}
Quadrature filters are a well known method of low-level computer vision for estimating certain properties of the signal, as there are local amplitude and local phase. However, 2D quadrature filters suffer from being not rotation invariant. Furthermore, they do not allow to detect truly 2D features as corners and junctions unless they are combined to form the structure tensor. The present paper deals with a new 2D generalization of quadrature filters which is rotation invariant and allows to analyze intrinsically 2D signals. Hence, the new approach can be considered as the union of properties of quadrature filters and of the structure tensor. The proposed method first estimates the local orientation of the signal which is then used for steering some basis filter responses. Certain linear combination of these filter responses are derived which allow to estimate the local isotropy and two perpendicular phases of the signal. The phase model is based on the assumption of an angular band-limitation in the signal. As an application, a simple and efficient point-of-interest operator is presented and it is compared to the Plessey detector.
@inproceedings{diva2:246027,
author = {Felsberg, Michael and Sommer, Gerald},
title = {{Image Features Based on a New Approach to 2D Rotation Invariant Quadrature Filters}},
booktitle = {Computer Vision - ECCV 2002 eds A. Heyden and G. Sparr and M. Nielsen and P. Johansen},
year = {2002},
series = {Lecture Notes in Computer Science},
volume = {2350},
pages = {369--383},
}
This paper describes how a world model for successive recognition can be learned using associative learning. The learned world model consists of a linear mapping that successively updates a high-dimensional system state using performed actions and observed percepts. The actions of the system are learned by rewarding actions that are good at resolving state ambiguities. As a demonstration, the system is used to resolve the localisation problem in a labyrinth.
@inproceedings{diva2:245959,
author = {Forssen, Per-Erik},
title = {{Successive Recognition using Local State Models}},
booktitle = {Proceedings SSAB02 Symposium on Image Analysis},
year = {2002},
pages = {9--12},
}
We introduce a compact coding of image information which explicitely separates geometric information (orientation) and structural information (phase and color). We investigate the importance of these factors for stereo matching on a large data set. From these investigation we can conclude that it is their combination that gives the best results. Concrete weights for their relative importance are measured.
@inproceedings{diva2:242254,
author = {Kruger, Norbert and Felsberg, Michael and Gebken, Christian and Pörksen, Martin},
title = {{An Explicit and Compact Coding of Geometric and Structural Information Applied to Stereo Processing}},
booktitle = {Vision, Modeling, and Visualization},
year = {2002},
}
A common wish in non-destructive testing is to investigate a large object with a small interesting detail inside. Due to practical circumstances, the projections may sometimes be truncated. According to the theory on tomography, it is then impossible to reconstruct the object. However, sometimes it is possible to receive an approximate result. It turns out that the key-point is how to implement the ramp-filter. The quality of the result depends on the object itself. We show one good experiment on real data, linear cone-beam tomography for logs. We also show experiments on the Shepp-Logan phantom, well-known from medical CT, and discuss the varying reconstruction quality.
@inproceedings{diva2:241588,
author = {Magnusson Seger, Maria},
title = {{Rampfilter implementation on truncated projection data. Application to 3D linear tomography for logs.}},
booktitle = {Proceedings SSAB02 Symposium on Image Analysis},
year = {2002},
pages = {33--36},
}
One important problem within the WITAS project is detection of moving objects in aerial images. This paper presents an original method to estimate the displacement between two frames, based on multiscale local polynomial expansions of the images. When the displacement field has been computed, a plane + parallax approach is used to separate moving objects from the camera egomotion.
@inproceedings{diva2:241572,
author = {Farnebäck, Gunnar and Nordberg, Klas},
title = {{Motion Detection in the WITAS Project}},
booktitle = {Swedish Symposium on Image Analysis (SSBA)},
year = {2002},
pages = {99--102},
}
This paper advocates the use of overlapping bins in histogram creation. It is shown how conventional histogram creation has an inherent quantisation that cause errors much like those in sampling with insufficient band limitation. The use of overlapping bins is shown to be the deterministic equivalent to dithering. Two applications of soft histograms are shown: Improved peak localisation in an estimated probability density function (PDF) without requiring more samples, and accurate estimation of image rotation.
@inproceedings{diva2:273868,
author = {Forssen, Per-Erik},
title = {{Image Analysis using Soft Histograms}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {2001},
pages = {109--112},
}
This paper presents a novel disparity estimation algorithm based on local polynomial expansion of the images in a stereo pair. Being a spin-off from work on two-frame motion estimation, it is primarily intended as a proof of concept for some of the underlying ideas. It may, however, be useful on its own as well, since it is very simple and fast. The accuracy still remains to be determined.
@inproceedings{diva2:246104,
author = {Farnebäck, Gunnar},
title = {{Disparity Estimation from Local Polynomial Expansion}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {2001},
pages = {77--80},
}
This paper shows how canonical correlation can be used to learn a detector for corner orientation invariant to corner angle and intensity. Pairs of images with the same corner orientation but different angle and intensity are used as training samples. Three different image representations; intensity values, products between intensity values, and local orientation are examined. The last representation gives a well behaved result that is easy to decode into the corner orientation. To reduce dimensionality, parameters from a polynomial model fitted on the different representations is also considered. This reduction did not affect the performance of the system.
@inproceedings{diva2:246077,
author = {Johansson, Björn and Borga, Magnus and Knutsson, Hans},
title = {{Learning Corner Orientation Using Canonical Correlation}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {2001},
pages = {89--92},
}
In [Farnebäck00] we presented a new velocity estimation algorithm, using orientation tensors and parametric motion models to provide both fast and accurate results. One of the tradeoffs between accuracy and speed was that no attempts were made to obtain regions of coherent motion when estimating the parametric models. In this paper we show how this can be improved by doing a simultaneous segmentation of the motion field. The resulting algorithm is slower than the previous one, but more accurate. This is shown by evaluation on the well-known Yosemite sequence, where already the previous algorithm showed an accuracy which was substantially better than for earlier published methods. This result has now been improved further.
@inproceedings{diva2:246017,
author = {Farnebäck, Gunnar},
title = {{Very High Accuracy Velocity Estimation using Orientation Tensors, Parametric Motion, and Simultaneous Segmentation of the Motion Field}},
booktitle = {Proceedings of the Eighth IEEE International Conference on Computer Vision},
year = {2001},
pages = {171--177},
}
Essentially all Computer Vision strategies require initial computation of orientation structure or motion estimation. Although much work has been invested in this subfield, methods have so far been very computationally demanding and/or not very robust. In this paper we present a novel method for computation of orientation tensors for signals of any dimensionality. The method is based on local weighted least squares approximations of the signal by second degree polynomials. It is shown how this can be implemented very efficiently by means of separable convolutions and that the method gives very accurate orientation estimates. We also introduce the new concept of orientation functionals, of which orientation tensors is a subclass. Finally we demonstrate the critical importance of using a proper weighting function in the local projection of the signal onto polynomials.
@inproceedings{diva2:273875,
author = {Farnebäck, Gunnar},
title = {{Orientation Estimation Based on Weighted Projection onto Quadratic Polynomials}},
booktitle = {Vision, Modeling, and Visualization 2000: proceedings},
year = {2000},
pages = {89--96},
}
We present a novel method that finds edges between certain image features, e.g. gray-levels, and disregards edges between other features. The method uses a channel representation of the features and performs normalized convolution using the channel values as certainties. This means that areas with certain features can be disregarded by the edge filter. The method provides an important new tool for finding tissue specific edges in medical images, as demonstrated by an MR-image example.
@inproceedings{diva2:273861,
author = {Borga, Magnus and Malmgren, Helge and Knutsson, Hans},
title = {{FSED - Feature Selective Edge Detection}},
booktitle = {ICPR15},
year = {2000},
pages = {229--232 vol.1},
publisher = {IEEE},
}
This article describes an essential step towards what is called a view centered representation of the low-level structure in an image. Instead of representing low-level structure (lines and edges) in one compact feature map, we will separate structural information into several feature maps, each signifying features at a characteristic phase, in a specific scale. By characteristic phase we mean the phases 0, pi, and +/-pi/2, corresponding to bright, and dark lines, and edges between different intensity levels, or colours. A lateral inhibition mechanism selects the strongest feature within each local region of scale represented. The scale representation is limited to maps one octave apart, but can be interpolated to provide a continous representation. The resultant image representation is sparse, and thus well suited for further processing, such as pattern detection.
@inproceedings{diva2:269833,
author = {Forss\'{e}n, Per-Erik and Granlund, Gösta},
title = {{Sparse feature maps in a scale hierarchy}},
booktitle = {Algebraic Frames for the Perception-Action Cycle, Proceedings Second International Workshop, AFPAC 2000},
year = {2000},
series = {Lecture Notes in Computer Science},
volume = {1888},
pages = {186--196},
publisher = {Springer Berlin/Heidelberg},
address = {Berlin, Heidelberg},
}
The purpose of this paper is to provide a broad overview of the WITAS Unmanned Aerial Vehicle Project. The WITAS UAV project is an ambitious, long-term basic research project with the goal of developing technologies and functionalities necessary for the successful deployment of a fully autonomous UAV operating over diverse geographical terrain containing road and traffic networks. Theproject is multi-disciplinary in nature, requiring many different research competences, and covering a broad spectrum of basic research issues, many of which relate to current topics in artificial intelligence. A number of topics considered are knowledge representation issues, active vision systems and their integration with deliberative/reactive architectures, helicopter modeling and control, ground operator dialogue systems, actual physical platforms, and a number of simulation techniques.
@inproceedings{diva2:262443,
author = {Doherty, Patrick and Granlund, Gösta and Kuchcinski, Krzysztof and Sandewall, Erik Johan and Nordberg, Klas and Skarman, Erik and Wiklund, Johan},
title = {{The WITAS unmanned aerial vehicle project}},
booktitle = {Proceedings of the 14th European Conference on Artificial Intelligence (ECAI)},
year = {2000},
pages = {747--755},
publisher = {IOS Press},
address = {Amsterdam},
}
Perceptual experiments indicate that corners and curvature are very important features in the process of recognition. This paper presents a new method to detect rotational symmetries, which describes complex curvature such as corners, circles, star, and spiral patterns. It works in two steps: 1) it extracts local orientation from a gray-scale or color image; and 2) it applies normalized convolution on the orientation image with rotational symmetry filters as basis functions. These symmetries can serve as feature points at a high abstraction level for use in hierarchical matching structures for 3D estimation, object recognition, image database retrieval, etc
@inproceedings{diva2:257171,
author = {Johansson, Björn and Knutsson, Hans and Granlund, Gösta},
title = {{Detecting Rotational Symmetries using Normalized Convolution}},
booktitle = {Proceedings of the 15th International Conference on Pattern Recognition,2000},
year = {2000},
pages = {496--500 vol.3},
publisher = {IEEE},
}
This paper presents a general strategy for automated generation of efficient representations in vision. The approach is highly task oriented and what constitutes the relevant information is defined by a set of examples. The examples are pairs of situations that are dependent through the chosen feature but are otherwise independent. Particularly important concepts in the work are mutual information and canonical correlation. How visual operators and representations can be generated from examples are presented for a number of features, e.g. local orientation, disparity and motion. Interesting similarities to biological vision functions are observed. The results clearly demonstrates the potential of combining advanced filtering techniques and learning strategies based on canonical correlation analysis (CCA).
@inproceedings{diva2:250400,
author = {Knutsson, Hans and Andersson, Mats and Borga, Magnus and Wiklund, Johan},
title = {{Automated generation of representations in vision}},
booktitle = {International Conference on Pattern Recognition ICPR,2000},
year = {2000},
pages = {59--66 vol.3},
publisher = {IEEE},
address = {Barcelona, Spain},
}
The WITAS Unmanned Aerial Vehicle Project is a long term basic research project located at Linköping University (LIU), Sweden. The project is multi-disciplinary in nature and involves cooperation with different departments at LIU, and a number of other universities in Europe, the USA, and South America. In addition to academic cooperation, the project involves collaboration with a number of private companies supplying products and expertise related to simulation tools and models, and the hardware and sensory platforms used for actual flight experimentation with the UAV. Currently, the project is in its second phase with an intended duration from 2000-2003.
This paper will begin with a brief overview of the project, but will focus primarily on the computer vision related issues associated with interpreting the operational environment which consists of traffic and road networks and vehicular patterns associated with these networks.
@inproceedings{diva2:246122,
author = {Granlund, Gösta and Nordberg, Klas and Wiklund, Johan and Doherty, Patrick and Skarman, Erik and Sandewall, Erik},
title = {{WITAS:
An Intelligent Autonomous Aircraft Using Active Vision}},
booktitle = {Proceedings of the UAV 2000 International Technical Conference and Exhibition (UAV)},
year = {2000},
publisher = {Euro UVS},
address = {Paris, France},
}
This contest involved the running and evaluation of computer vision and pattern recognition techniques on different data sets with known groundwidth. The contest included three areas; binary shape recognition, symbol recognition and image flow estimation. A package was made available for each area. Each package contained either real images with manual groundtruth or programs to generate data sets of ideal as well as noisy images with known groundtruth. They also contained programs to evaluate the results of an algorithm according to the given groundtruth. These evaluation criteria included the generation of confusion matrices, computation of the misdetection and false alarm rates and other performance measures suitable for the problems. The paper summarizes the data generation for each area and experimental results for a total of six participating algorithms
@inproceedings{diva2:246037,
author = {Aksoy, Selim and Ming, Ye and Schauf, Michael L. and Song, Mingzhou and Wang, Yalin and Haralick, Robert M. and Parker, Jim R. and Pivovarov, Juraj and Royko, Dominik and Sun, Changming and Farnebäck, Gunnar},
title = {{Algorithm Performance Contest}},
booktitle = {Proceedings. 15th International Conference on Pattern Recognition, 2000},
year = {2000},
series = {Pattern Recognition},
volume = {vol. 4},
pages = {870--876},
publisher = {IEEE},
}
Motion estimation in image sequences is an important step in many computer vision and image processing applications. Several methods for solving this problem have been proposed, but very few manage to achieve a high level of accuracy without sacrificing processing speed. This paper presents a novel motion estimation algorithm, which gives excellent results on both counts. The algorithm starts by computing 3D orientation tensors from the image sequence. These are combined under the constraints of a parametric motion model to produce velocity estimates. Evaluated on the well-known Yosemite sequence, the algorithm shows an accuracy which is substantially better than for previously published methods. Computationally the algorithm is simple and can be implemented by means of separable convolutions, which also makes it fast.
@inproceedings{diva2:241589,
author = {Farnebäck, Gunnar},
title = {{Fast and Accurate Motion Estimation using Orientation Tensors and Parametric Motion Models}},
booktitle = {ICPR15},
year = {2000},
pages = {135--139 vol.1},
publisher = {IEEE},
}
Frame representations (e.g. wavelets) and subspace projections are important tools in many image processing applications. A unified framework for frames and subspace bases, as well as bases and subspace frames, is developed for finite dimensional vector spaces. Dual (subspace) bases and frames are constructed and the theory is generalized to weighted norms and seminorms. It is demonstrated how the framework applies to the cubic facet model, to normalized convolution, and to projection onto second degree polynomials.
@inproceedings{diva2:273877,
author = {Farnebäck, Gunnar},
title = {{A Unified Framework for Bases, Frames, Subspace Bases, and Subspace Frames}},
booktitle = {Proceedings of the 11th Scandinavian Conference on Image Analysis},
year = {1999},
pages = {341--349},
}
Digital subtraction angiography, whether based on traditional X-ray or MR, suffers from patient motion artifacts. Until now, the usual remedy is to pixel shift by hand, or in some cases performing a global pixel shift semi-automatically. This is time consuming, and cannot handle rotations or local varying deformations over the image. We have developed a fully automatic algorithm that provides for motion compensation in the presence of large local deformations. Our motion compensation is very accurate for ordinary motions, including large rotations and deformations. It does not matter if the motions are irregular over time. For most images, it takes about a second per image to get adequate accuracy. The method is based on using the phase from filter banks of quadrature filters tuned in different directions and frequencies. Unlike traditional methods for optical flow and correlation, our method is more accurate and less susceptible to disturbing changes in the image, e.g. a moving contrast bolus. The implications for common practice are that radiologists' time can be significantly reduced in ordinary peripheral angiographies and that the number of retakes due to large or local motion artifacts will be much reduced.
@inproceedings{diva2:273850,
author = {Hemmendorff, Magnus and Knutsson, Hans and Andersson, Mats T. and Kronander, Torbjörn},
title = {{Motion compensated digital subraction angiography}},
booktitle = {Proceedings of SPIE's International Symposium on Medical Imaging, vol 3661, 1999},
year = {1999},
}
This paper presents a general strategy for designing efficient visual operators. The approach is highly task oriented and what constitutes the relevant information is defined by a set of examples. The examples are pairs of images displaying a strong dependence in the chosen feature but are otherwise independent. Particularly important concepts in the work are mutual information and canonical correlation. Visual operators learned from examples are presented, e.g. local shift invariant orientation operators and image content invariant disparity operators. Interesting similarities to biological vision functions are observed.
@inproceedings{diva2:246039,
author = {Knutsson, Hans and Borga, Magnus},
title = {{Learning Visual Operators from Examples:
A New Paradigm in Image Processing}},
booktitle = {Proceedings of the 10th International Conference on Image Analysis and Processing (ICIAP'99)},
year = {1999},
}
This paper presents a new and efficient approach for optimization and implementation of filter banks e.g. velocity channels, orientation channels and scale spaces. The multi layered structure of a filter network enable a powerful decomposition of complex filters into simple filter components and the intermediary results may contribute to several output nodes. Compared to a direct implementation a filter network uses only a fraction of the coefficients to provide the same result. The optimization procedure is recursive and all filters on each level are optimized simultaneously. The individual filters of the network, in general, contain very few non-zero coefficients, but there are are no restrictions on the spatial position of the coefficients, they may e.g. be concentrated on a line or be sparsely scattered. An efficient implementation of a quadrature filter hierarchy for generic purposes using sparse filter components is presented.
@inproceedings{diva2:242258,
author = {Andersson, Mats and Wiklund, Johan and Knutsson, Hans},
title = {{Filter Networks}},
booktitle = {Proceedings of Signal and Image Processing (SIP'99)},
year = {1999},
pages = {213--217},
publisher = {IASTED},
address = {Nassau, Bahamas},
}
This paper presents a general approach for obtaining optimal filters as well as filter sequences. A filter is termed optimal when it minimizes a chosen distance measure with respect to an ideal filter. The method allows specification of the metric via simultaneous weighting functions in multiple domains, e.g. the spatio-temporal space and the Fourier space. It is shown how convolution kernels for efficient spatio-temporal filtering can be implemented in practical situations. The method is based on applying a set of jointly optimized filter kernels in sequence. The optimization of sequential filters is performed using a novel recursive optimization technique. A number of optimization examples are given that demonstrate the role of key parameters such as: number of kernel coefficients, number of filters in sequence, spatio-temporal and Fourier space metrics. In multidimensional filtering applications the method potentially outperforms both standard convolution and FFT based approaches by two-digit numbers.
@inproceedings{diva2:242180,
author = {Knutsson, Hans and Andersson, Mats and Wiklund, Johan},
title = {{Multiple Space Filter Design}},
booktitle = {Proceedings of the SSAB symposium on image analysis},
year = {1999},
}
In this paper we present a system which integrates computer vision and decision-making in an autonomous airborne vehicle that performs traffic surveillance tasks. The main factors that make the integration of vision and decision-making a challenging problem are: the qualitatively different kind of information at the decision-making and vision levels, the need for integration of dynamically acquired information with a priori knowledge, e.g. GIS information, and the need of close feedback and guidance of the vision module by the decision-making module. Given the complex interaction between the vision module and the decision-making module we propose the adoption of an intermediate structure, called Scene Information Manager, and describe its structure and functionalities.
@inproceedings{diva2:241594,
author = {Coradeschi, Silvia and Karlsson, Lars and Nordberg, Klas},
title = {{Integration of vision and decision-making in an autonomous airborne vehicle for traffic surveillance}},
booktitle = {Proceedings of the International Conference on Vision Systems '99},
year = {1999},
}
There is no indication that it will ever be possible to find some simple trick that miraculously solves most problems in vision. It turns out that the processing system must be able to implement a model structure, the complexity of which is directly related to the structural complexity of the problem under consideration in the external world. It has become increasingly apparent that Vision cannot be treated in isolation from the response generation, because a very high degree of integration is required between different levels of percepts and corresponding response primitives. The response to be produced at a given instance is as much dependent upon the state of the system, as the percepts impinging upon the system. In addition, it has become apparent that many classical aspects of perception, such as geometry, probably do not belong to the percept domain of a Vision system, but to the response domain. This article will focus on what are considered crucial problems in Vision for robotics for the future, rather than on the classical solutions today. It will discuss hierarchical architectures for combination of percept and response primitives. It will discuss the concept of combined percept–response invariances as important structural elements for Vision. It will be maintained that learning is essential to obtain the necessary flexibility and adaptivity. In consequence, it will be argued that invariances for the purpose of Vision are not abstractly geometrical, but derived from the percept–response interaction with the environment. The issue of information representation becomes extremely important in distributed structures of the types foreseen, where uncertainty of information has to be stated for update of models and associated data. The question of object representation is central to the paper. Equivalence is established between the representations of response, geometry and time. Finally an integrated percept–response structure is proposed for flexible response control.
@inproceedings{diva2:241582,
author = {Granlund, Gösta},
title = {{Does Vision Inevitably Have to be Active?}},
booktitle = {Proceedings of the 11th Scandinavian Conference on Image Analysis},
year = {1999},
}
This paper presents a general approach for obtaining optimal filters as well as filter sequences. A filter is termed optimal when it minimizes a chosen distance measure with respect to an ideal filter. The method allows specification of the metric via simultaneous weighting functions in multiple domains, e.g. the spatio-temporal space and the Fourier space. Metric classes suitable for optimization of localized filters for multidimensional signal processing are suggested and discussed.
It is shown how convolution kernels for efficient spatio-temporal filtering can be implemented in practical situations. The method is based on applying a set of jointly optimized filter kernels in sequence. The optimization of sequential filters is performed using a novel recursive optimization technique. A number of optimization examples are given that demonstrate the role of key parameters such as: number of kernel coefficients, number of filters in sequence, spatio-temporal and Fourier space metrics.
The sequential filtering method enables filtering using only a small fraction of the number of filter coefficients required using conventional filtering. In multidimensional filtering applications the method potentially outperforms both standard convolution and FFT based approaches by two-digit numbers.
@inproceedings{diva2:241578,
author = {Knutsson, Hans and Andersson, Mats and Wiklund, Johan},
title = {{Advanced Filter Design}},
booktitle = {Proceedings of the 11th Scandinavian Conference on Image Analysis},
year = {1999},
pages = {185--193},
publisher = {SCIA},
}
This paper introduces a novel algorithm for extracting the optical flow obtained from a translating camera in a static scene. Occlusion between objects is incorporated as a natural component in a scene reconstruction strategy by first evaluate and reconstruct the foreground and then exclude its influence on the partly occluded objects behind.
@inproceedings{diva2:273869,
author = {Ulvklo, Morgan and Granlund, Gösta H. and Knutsson, Hans},
title = {{Adaptive Reconstruction Using Multiple Views}},
booktitle = {Proceedings of the IEEE Southwest Symposium on Image Analysis and Interpretation},
year = {1998},
pages = {47--52},
}
This paper presents our general strategy for designing learning machines as well as a number of particular designs. The search for methods allowing a sufficient level of adaptivity are based on two main principles: 1. Simple adaptive local models and 2. Adaptive model distribution. Particularly important concepts in our work is mutual information and canonical correlation. Examples are given on learning feature descriptors, modeling disparity, synthesis of a global 3-mode model and a setup for reinforcement learning of online video coder parameter control.
@inproceedings{diva2:273852,
author = {Knutsson, Hans and Borga, Magnus and Landelius, Tomas},
title = {{Learning Multidimensional Signal Processing}},
booktitle = {Proceedings of the 14th International Conference on Pattern Recognition, vol 2},
year = {1998},
pages = {1416--1420},
}
This paper introduces a signal processing strategy for depth segmentation and scene reconstruction that incorporates occlusion as a natural component. The work aims to maximize the use of connectivity in the temporal domain as much as possible under the condition that the scene is static and that the camera motion is known. An object behind the foreground is reconstructed using the fact that different parts of the object have been seen in different images in the sequence. One of the main ideas in this paper is the use of a spatio- temporal certainty volume c(x) with the same dimension as the input spatio- temporal volume s(x), and then use c(x) as a 'blackboard' for rejecting already segmented image structures. The segmentation starts with searching for image structures in the foreground, eliminate their occluding influence, and then proceed. Normalized convolution, which is a Weighted Least Mean Square technique for filtering data with varying spatial reliability, is used for all filtering. High spatial resolution near object borders is achieved and only neighboring structures with similar depth supports each other.
@inproceedings{diva2:273841,
author = {Ulvklo, Morgan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Depth Segmentation and Occluded Scene Reconstruction using Ego-motion}},
booktitle = {Proceedings of the SPIE Conference on Visual Information Processing},
year = {1998},
pages = {112--123},
}
WITAS will be engaged in goal-directed basic research in the area of intelligent autonomous vehicles and other autonomous systems. In this paper an overview of the project is given together with a presentation of our research interests in the project. The current status of our part in the project is also given.
@inproceedings{diva2:273831,
author = {Andersson, Thord and Granlund, Gösta H. and Farnebäck, Gunnar and Nordberg, Klas and Wiklund, Johan},
title = {{WITAS Project at Computer Vision Laboratory; A status report (Jan 1998)}},
booktitle = {Proceedings of the SSAB symposium on image analysis},
year = {1998},
pages = {113--116},
}
@inproceedings{diva2:246093,
author = {Lenz, Reiner and Granlund, Gösta},
title = {{If I had a fisheye I would not need SO(1,n) or, Is hyperbolic geometry useful in image processing?}},
booktitle = {Proceedings from the SSAB Symposium on Image Analysis},
year = {1998},
}
This paper presents a novel algorithm that uses CCA and phase analysis to detect the disparity in stereo images. The algorithm adapts filters in each local neighbourhood of the image in a way which maximizes the correlation between the filtered images. The adapted filters are then analyzed to find the disparity. This is done by a simple phase analysis of the scalar product of the filters. The algorithm can even handle cases where the images have different scales. The algorithm can also handle depth discontinuities and give multiple depth estimates for semi-transparent images.
@inproceedings{diva2:245992,
author = {Borga, Magnus and Knutsson, Hans},
title = {{An Adaptive Stereo Algorithm Based on Canonical Correlation Analysis}},
booktitle = {Proceedings of the Second IEEE International Conference on Intelligent Processing Systems},
year = {1998},
pages = {177--182},
}
This paper addresses the problem of motion-based segmentation of image sequences. One motion estimation algorithm and two segmentation algorithms are presented. The motion estimation is based on 3D orientation tensors and the algorithm can be used to estimate a large class of motion models, including the affine model that is used in the segmentation. The segmentation algorithms are based on a competitive region growing approach.
@inproceedings{diva2:273874,
author = {Farnebäck, Gunnar},
title = {{Motion-based Segmentation of Image Sequences using Orientation Tensors}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1997},
pages = {31--35},
}
@inproceedings{diva2:246082,
author = {Westin, Carl-Fredrik and Bhalerao, A. and Knutsson, Hans and Kikinis, Ron},
title = {{Using Local 3D Structure for Segmentation of Bone from Computer Tomography Images}},
booktitle = {Proceedings of IEEE CVPR 1997},
year = {1997},
}
It has become increasingly apparent that perception cannot be treated in isolation from the response generation, firstly because a very high degree of integration is required between different levels of percepts and corresponding response primitives. Secondly, it turns out that the response to be produced at a given instance is as much dependent upon the state of the system, as the percepts impinging upon the system. The state of the system is in consequence the combination of the responses produced and the percepts associated with these responses. Thirdly, it has become apparent that many classical aspects of perception, such as geometry, probably do not belong to the percept domain of a Vision system, but to the response domain. There are not yet solutions available to all of these problems. In consequence, this overview will focus on what are considered crucial problems for the future, rather than on the solutions available today. It will discuss hierarchical architectures for combination of percept and response primitives, and the concept of combined percept-response invariances as important structural elements for Vision. It will be maintained that learning is essential to obtain the necessary exibility and adaptivity. In consequence, it will be argued that invariances for the purpose of vision are not geometrical but derived from the percept-response interaction with the environment. The issue of information representation becomes extremely important in distributed structures of the types foreseen, where uncertainty of information has to be stated for update of models and associated data.
@inproceedings{diva2:245978,
author = {Granlund, Gösta H.},
title = {{From Multidimensional Signals to the Generation of Responses}},
booktitle = {Algebraic Frames for the Perception-Action Cycle, eds G. Sommer and J. J. Koenderink},
year = {1997},
series = {Lecture Notes in Computer Science},
volume = {1315},
pages = {29--53},
publisher = {Springer-Verlag},
}
@inproceedings{diva2:241593,
author = {Coradeschi, Silvia and Nordberg, Klas and Karlsson, Lars},
title = {{Integration of vision and reasoning in an airborne autonomous vehicle for traffic surveillance}},
booktitle = {Knowledge Based Computer Vision, Seminar-Report 196},
year = {1997},
}
@inproceedings{diva2:241573,
author = {Nordberg, Klas and Bergvall, Mathias and Granlund, Gösta H.},
title = {{Building Object Models from Range Data}},
booktitle = {Robotikdagar 97},
year = {1997},
}
@inproceedings{diva2:241561,
author = {Granlund, Gösta H.},
title = {{From signal to response:
Issues in representation and computation}},
booktitle = {Proceedings of TFTS'97, The 2nd IEEE UK Symposium on Applications of Time-frequency and Time-scale Methods},
year = {1997},
}
@inproceedings{diva2:246109,
author = {Westin, Carl-Fredrik and Westelius, Carl-Johan and Knutsson, Hans and Granlund, Gösta},
title = {{Attention Control for Robot Vision}},
booktitle = {CVPR},
year = {1996},
pages = {726--733},
publisher = {IEEE Computer Society Press},
}
@inproceedings{diva2:246060,
author = {Granlund, Gösta H.},
title = {{Operations and Representations for Multidimensional Information}},
booktitle = {Proceedings of RecPad'96, The 8th Portuguese Conference on Pattern Recognition},
year = {1996},
}
@inproceedings{diva2:246052,
author = {Granlund, Gösta H.},
title = {{Response Generation and Learning Crucial Issues in Machine Vision}},
booktitle = {Machine Perception Applications. Proc. of the IAPR TC-8 Workshop in Machine Perception Applications, Technical University, Graz, Austria, 2--3 September, 1996, eds A. Pinz and W. Pölzleitner},
year = {1996},
pages = {155--184},
}
@inproceedings{diva2:246023,
author = {Nordberg, Klas and Granlund, Gösta},
title = {{Equivariance and Invariance -- An Approach Based on Lie Groups}},
booktitle = {ICIP},
year = {1996},
}
A recursive method for separation of spherically separable quadrature filters into simple kernels with mainly one dimensional extent has been worked out. The resulting filter responses are mapped to a non biased tensor representation where the local tensor constitutes a robust estimate of both the shape and the orientation (velocity) of the neighbourhood. The performance of this General Sequential Filter concept has exceeded the authors most optimistic expectations. A qualitative evaluation results in no detectable loss in accuracy when compared to conventional FIR (Finite Impulse Response) filters but the computation is performed 20-30 times faster. The magnitude of the attained speed-up implies that complex spatio-temporal analysis can be performed using standard hardware, such as a powerful workstation, in close to real time. Due to the soft implementation of the convolver and the tree structure of the sequential filtering approach the processing is simple to optimize for most standard hardware. The method used in the examples was implemented in AVS (Application Visualization System) using modules written in C.
@inproceedings{diva2:273873,
author = {Knutsson, Hans and Andersson, Mats},
title = {{Optimization of Sequential Filters}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1995},
pages = {87--90},
}
@inproceedings{diva2:246075,
author = {Ulvklo, Morgan and Granlund, Gösta H. and Knutsson, Hans},
title = {{Texture Gradient in Sparse Texture Fields}},
booktitle = {SCIA9},
year = {1995},
pages = {885--894},
}
A scheme for performing generalized convolutions is presented. A flexible convolver, which runs on standard workstations, has been implemented. It is designed for maximum throughput and flexibility. The implementation incorporates spatio-temporal convolutions with configurable vector combinations. It can handle general multi-linear operations, i.e. tensor operations on multidimensional data of any order. The input data and the kernel coefficients can be of arbitrary vector length. The convolver is configurable for IIR filters in the time dimension. Other features of the implemented convolver are scattered kernel data, region of interest and subsampling. The implementation is done as a C-library and a graphical user interface in AVS (Application Visualization System).
@inproceedings{diva2:246061,
author = {Wiklund, Johan and Knutsson, Hans},
title = {{A Generalized Convolver}},
booktitle = {SCIA9},
year = {1995},
}
@inproceedings{diva2:246038,
author = {Granlund, Gösta},
title = {{Biological vision:
a source of challenges and ideas}},
booktitle = {DSAGM, Dansk Selskab for Genkendelse af Mønstre},
year = {1995},
}
@inproceedings{diva2:246030,
author = {Landelius, Tomas and Knutsson, Hans},
title = {{Behaviorism and Reinforcement Learning}},
booktitle = {Proceedings, 2nd Swedish Conference on Connectionism},
year = {1995},
pages = {259--270},
}
A scheme for performing generalized convolutions is presented. A flexibleconvolver, which runs on standard workstations, has been implemented. It isdesigned for maximum throughput and flexibility. The implementation incorporatesspatio-temporal convolutions with configurable vector combinations. Itcan handle general multi-linear operations, i.e. tensor operations on multidimensionaldata of any order. The input data and the kernel coefficients canbe of arbitrary vector length. The convolver is configurable for IIR filters inthe time dimension. Other features of the implemented convolver are scatteredkernel data, region of interest and subsampling. The implementation is doneas a C-library and a graphical user interface in AVS (Application VisualizationSystem).A scheme for performing generalized convolutions is presented. A flexible convolver, which runs on standard workstations, has been implemented. It is designed for maximum throughput and flexibility. The implementation incorporates spatio-temporal convolutions with configurable vector combinations. It can handle general multi-linear operations, i.e. tensor operations on multidimensional data of any order. The input data and the kernel coefficients can be of arbitrary vector length. The convolver is configurable for IIR filters in the time dimension. Other features of the implemented convolver are scattered kernel data, region of interest and subsampling. The implementation is done as a C-library and a graphical user interface in AVS (Application Visualization System).
@inproceedings{diva2:242175,
author = {Karlholm, Jörgen and Westelius, Carl-Johan and Knutsson, Hans},
title = {{Object Tracking Based on the Orientation Tensor Concept}},
booktitle = {SCIA9, Uppsala},
year = {1995},
}
@inproceedings{diva2:515077,
author = {Nordberg, Klas and Granlund, Gösta and Knutsson, Hans},
title = {{Representation and learning of invariance}},
booktitle = {Image Processing, 1994. Proceedings. ICIP-94., IEEE International Conference},
year = {1994},
pages = {585--589},
}
@inproceedings{diva2:273855,
author = {Knutsson, Hans and Westin, Carl-Fredrik and Granlund, Gösta H.},
title = {{Local Multiscale Frequency and Bandwidth Estimation}},
booktitle = {ICIP},
year = {1994},
pages = {36--40},
}
In this paper it is shown how estimates of local structure and orientation can be obtained using a set of spherically separable quadrature filters. The method is applicable to signals of any dimensionality the only requirement being that the filter set spans the corresponding orientation space. The estimates produced are 2:nd order tensors, the size of the tensors corresponding to the dimensionality of the input signal. A central part of the algorithm is an operation termed Tensor Whitening reminiscent of classical whitening procedures. This operation compensates exactly for any biases introduced by non-uniform filter orientation distributions and/or non-uniform filter output certainties. Examples of processing of 2D-images, 3D-volumes and 2D-image sequences are given. Sensitivity to noise and missing filter outputs are analyzed in different situations. Estimation accuracy as a function of filter orientation distributions are studied. The studies provide evidence that the algorithm is robust and preferable to other algorithms in a wide range of situations.
@inproceedings{diva2:245986,
author = {Knutsson, Hans and Andersson, Magnus},
title = {{Robust N-Dimensional Orientation Estimation using Quadrature Filters and Tensor Whitening}},
booktitle = {ICASSP},
year = {1994},
}
This paper establishes an algebraic relation between two methods recently reported; normalized convolution and normalized differential convolution. These are general methods for filtering incomplete or uncertain data and are based on the separation of both data and operator into a signal part and a certainty part. General filtering can be performed without preprocessing input data with an interpolation step. The methods allow both data and operators to be scalars, vectors or tensors of higher order. Normalized differential convolution has been used in a wide range of applications. Examples are estimation of gradient estimation in irregularly sampled data, estimation of differential invariants in sparse image flow fields and image edge effect reduction. It was previously shown that normalized convolution produces a description of the neighbourhood which is optimal in a least square sense. The algebraic relation to normalized differential convolution presented in this paper proves that the latter method is also optimal in the same sense as well.
@inproceedings{diva2:245982,
author = {Westin, Carl-Fredrik and Nordberg, Klas and Knutsson, Hans},
title = {{On the Equivalence of Normalized Convolution and Normalized Differential Convolution, Vol. 5}},
booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994},
year = {1994},
pages = {457--460},
}
@inproceedings{diva2:241575,
author = {Haglund, Leif and Fleet, David},
title = {{Stable Estimation of Image Orientation}},
booktitle = {Proceedings of the IEEE-ICIP},
year = {1994},
pages = {68--72},
}
@inproceedings{diva2:273865,
author = {Knutsson, Hans and Westin, Carl-Fredrik},
title = {{Normalized and Differential Convolution:
Methods for Interpolation and Filtering of Incomplete and Uncertain Data}},
booktitle = {CVPR},
year = {1993},
pages = {515--523},
publisher = {IEEE},
}
In this paper learning is considered to be the bootstrapping procedure where fragmented past experience of what to do when performing well is used for generation of new responses adding more information to the system about the environment. The gained knowledge is represented by a behavior probability density function which is decomposed into a number of normal distributions using a binary tree. This tree structure is built by storing highly reinforced stimuli-response combinations, decisions, and calculating their mean decision vector and covariance matrix. Thereafter the decision space is divided, through the mean vector, into two halves along the direction of maximal data variation. The mean vector and the covariance matrix are stored in the tree node and the procedure is repeated recursively for each of the two halves of the decision space forming a binary tree with mean vectors and covariance matrices in its nodes. The tree is the systems guide to response generation. Given a stimuli the system searches for decisions likely to give a high reinforcement. This is accomplished by treating the sum of the normal distributions in the leaves, using their mean vectors and covariance matrices as the distribution parameters, as a distribution describing the systems behavior. A response is generated by fixating the stimuli in this sum of normal distribution and use the resulting distribution, which turns out to be a new sum of normal distributions, for random generation of the response. This procedure will also make it possible for the system to have several equally plausible response to one stimuli when this is appropriate. Not applying maximum likelihood principles will lead to a more explorative system behavior avoiding local minima traps.
@inproceedings{diva2:273860,
author = {Landelius, Tomas and Knutsson, Hans},
title = {{The Learning Tree, A New Concept in Learning}},
booktitle = {Proceedings of the 2nd International Conference on Adaptive and Learning Systems},
year = {1993},
series = {SPIE},
volume = {1962},
}
@inproceedings{diva2:273835,
author = {Andersson, Mats and Knutsson, Hans},
title = {{Controllable 3-D Filters}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1993},
}
@inproceedings{diva2:273833,
author = {Bårman, Håkan and Granlund, Gösta H.},
title = {{Using Simple Local Fourier Domain Models for Computer-Aided Analysis of Mammograms}},
booktitle = {SCIA8},
year = {1993},
pages = {479--486},
publisher = {NOBIM, Norwegian Society for Image Processing and Pattern Recognition},
}
@inproceedings{diva2:273830,
author = {Landelius, Tomas and Haglund, Leif and Knutsson, Hans},
title = {{Depth and Velocity from Orientation Tensor Fields}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1993},
}
@inproceedings{diva2:246099,
author = {Granlund, Gösta H.},
title = {{Image Sequence Analysis}},
booktitle = {Mustererkennung 1993, Mustererkennung im Dienste der Gesundheit eds S.J. Pöppl and H. Handels},
year = {1993},
pages = {1--18},
}
The tensor representation has proven a successful tool as a mean to describe local multi-dimensional orientation. In this respect, the tensor representation is a map from the local orientation to a second order tensor. This paper investigates how variations of the orientation are mapped to variation of the tensor, thereby giving an explicit equivariance relation. The results may be used in order to design tensor based algorithms for extraction of image features defined in terms of local variations of the orientation, \eg multi-dimensional curvature or circular symmetries. It is assumed that the variation of the local orientation can be described in terms of an orthogonal transformation group. Under this assumption a corresponding orthogonal transformation group, acting on the tensor, is constructed. Several correspondences between the two groups are demonstrated.
@inproceedings{diva2:246080,
author = {Nordberg, Klas and Knutsson, Hans and Granlund, Gösta},
title = {{On the Equivariance of the Orientation and the Tensor Field Representation}},
booktitle = {SCIA8},
year = {1993},
pages = {57--63},
publisher = {NOBIM, Norwegian Society for Image Processing and Pattern Recognition},
}
@inproceedings{diva2:246069,
author = {Westelius, Carl-Johan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Hierarchical Gaze Control Using a Multi-resolution Image Sensor}},
booktitle = {Proceedings from Robotics Workshop},
year = {1993},
}
A hierarchical representation of the input-output transition function in a learning system is suggested. The choice of either representing the knowledge in a learning system as a discrete set of input-output pairs or as a continuous input-output transition function is discussed. The conclusion that both representations could be efficient, but at different levels of abstraction is made. The difference between strategies and actions is defined. An algorithm for using adaptive critic methods in a two-level reinforcement learning system is presented. Simulations of a one dimensional hierarchical reinforcement learning system is presented.
@inproceedings{diva2:246059,
author = {Borga, Magnus},
title = {{Hierarchical Reinforcement Learning}},
booktitle = {ICANN'93 eds S. Gielen and B. Kappen},
year = {1993},
}
@inproceedings{diva2:246034,
author = {Knutsson, Hans and Westin, Carl-Fredrik and Westelius, Carl-Johan},
title = {{Filtering of Uncertain Irregularly Sampled Multidimensional Data}},
booktitle = {Twenty-seventh Asilomar Conf. on Signals, Systems \& Computers},
year = {1993},
pages = {1301--1309},
}
This paper contains a presentation of a scale and orientation adaptive filtering strategy for images. The size, shape and orientation of the filter are signal controlled and thus locally adapted to each neighbourhood according to an estimated model. On each scale the filter is constructed as a linear weighting of fixed oriented bandpass filters having the same shape but different orientations. The resulting filter is interpolated from all scale levels, and spans over more than 6 octaves. It is possible to reconstruct an enhanced original image from the filtered images. The performance of the reconstruction algorithm displays two desirable but normally contradictory features, namely edge enhancement and an improvement of the signal-to-noise ratio. The adaptive filtering method has been tested on both real data and synthesized test data. The results are very good on a wide variety of images from moderate signal-to-noise ratios to low, even lower than 0 dB, SNR.
@inproceedings{diva2:246019,
author = {Haglund, Leif and Knutsson, Hans and Granlund, Gösta H.},
title = {{Scale and Orientation Adaptive Filtering}},
booktitle = {SCIA8},
year = {1993},
}
@inproceedings{diva2:245990,
author = {Knutsson, Hans and Westin, Carl-Fredrik},
title = {{Robust Estimation from Sparse Feature Fields}},
booktitle = {Proceedings of EC--US Workshop},
year = {1993},
}
@inproceedings{diva2:242235,
author = {Bårman, Håkan and Granlund, Gösta H.},
title = {{Hierarchical Feature Extraction for Computer- Aided Analysis of Mammograms}},
booktitle = {BIOMEDICAL IMAGE PROCESSING IV AND BIOMEDICAL VISUALIZATI0N},
year = {1993},
}
@inproceedings{diva2:241569,
author = {Bårman, Håkan and Granlund, Gösta H.},
title = {{Computer-Aided Analysis of Mammograms}},
booktitle = {Proceedings Nordic symposium on PACS, Digital Radiology and Telemedicine},
year = {1993},
pages = {76--},
}
@inproceedings{diva2:241562,
author = {Granlund, Gösta H.},
title = {{Issues in Robot Vision}},
booktitle = {British Machine Vision Conference 1993},
year = {1993},
pages = {1--14},
}
@inproceedings{diva2:274007,
author = {Wilson, Roland and Knutsson, Hans},
title = {{Seeing Things -- Disagreements on the necessary properties of a system that `Recognizes'}},
booktitle = {Workshop on Vision},
year = {1992},
pages = {177--189},
}
@inproceedings{diva2:274004,
author = {Knutsson, Hans},
title = {{The meaninglessness of `Sit-and-stare' -- How Vision-Action-Understanding is inseparable}},
booktitle = {Workshop on Vision},
year = {1992},
pages = {9--20},
}
@inproceedings{diva2:273867,
author = {Nordberg, Klas and Knutsson, Hans},
title = {{Some New Ideas in Signal Representation}},
booktitle = {Proceedings of ECCV--92},
year = {1992},
series = {Lecture Notes in Computer Science},
volume = {588},
publisher = {Springer--Verlag},
}
@inproceedings{diva2:273853,
author = {Wiklund, Johan and Westelius, Carl-Johan and Knutsson, Hans},
title = {{Hierarchical Phase Based Disparity Estimation}},
booktitle = {Proceedings of 2nd Singapore International Conference on Image Processing},
year = {1992},
publisher = {World Scientific Publishing},
address = {Singapore, River Edge, NJ},
}
@inproceedings{diva2:273848,
author = {Calway, Andrew and Knutsson, Hans and Wilson, Roland},
title = {{Multiresolution Frequency Domain Algorithm for Fast Image Registration}},
booktitle = {Proc. 3rd Int. Conf. on Visual Search},
year = {1992},
}
@inproceedings{diva2:273845,
author = {Knutsson, Hans and Haglund, Leif and Bårman, Håkan and Granlund, Gösta H.},
title = {{A Framework for Anisotropic Adaptive Filtering and Analysis of Image Sequences and Volumes}},
booktitle = {Proceedings ICASSP-92},
year = {1992},
publisher = {IEEE},
}
@inproceedings{diva2:273832,
author = {Westelius, Carl-Johan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Preattentive Gaze Control for Robot Vision}},
booktitle = {Proceedings of Third International Conference on Visual Search},
year = {1992},
publisher = {Taylor and Francis},
}
@inproceedings{diva2:246090,
author = {Westin, Carl-Fredrik and Knutsson, Hans},
title = {{Extraction of Local Symmetries Using Tensor Field Filtering}},
booktitle = {Proceedings of 2nd Singapore International Conference on Image Processing},
year = {1992},
pages = {371--375},
}
@inproceedings{diva2:246066,
author = {Knutsson, Hans and Bårman, Håkan and Haglund, Leif},
title = {{Robust Orientation Estimation in 2D, 3D and 4D Using Tensors}},
booktitle = {Proceedings of Second International Conference on Automation, Robotics and Computer Vision, ICARCV'92},
year = {1992},
}
@inproceedings{diva2:245991,
author = {Haglund, Leif and Knutsson, Hans and Granlund, Gösta H.},
title = {{On Scale and Orientation Adaptive Filtering}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1992},
}
@inproceedings{diva2:245974,
author = {Calway, Andrew and Knutsson, Hans and Wilson, Roland},
title = {{Multiresolution Estimation of 2-d Disparity Using a Frequency Domain Approach}},
booktitle = {Proc. British Machine Vision Conf.},
year = {1992},
}
@inproceedings{diva2:245968,
author = {Nordberg, Klas and Knutsson, Hans and Granlund, Gösta},
title = {{Signal Representation using Operators}},
booktitle = {Proceedings of EUSIPCO--92},
year = {1992},
}
@inproceedings{diva2:241584,
author = {Knutsson, Hans and Haglund, Leif and Granlund, Gösta},
title = {{Adaptive Filtering of Image Sequences and Volumes}},
booktitle = {Proceedings of International Conference on Automation, Robotics and Computer Vision},
year = {1992},
}
@inproceedings{diva2:273876,
author = {Calway, Andrew and Wilson, Roland},
title = {{The Multiresolution Fourier Transform and its Application to Image Analysis}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1991},
}
@inproceedings{diva2:273858,
author = {Haglund, Leif and Bårman, Håkan and Knutsson, Hans},
title = {{Estimation of Velocity and Acceleration in Time Sequences}},
booktitle = {Proceedings of the 7th Scandinavian Conference on Image Analysis},
year = {1991},
pages = {1033--1041},
}
@inproceedings{diva2:273844,
author = {Westin, Carl-Fredrik and Knutsson, Hans},
title = {{The Möbius Strip Parameterization for Line Segmentation}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1991},
}
@inproceedings{diva2:273843,
author = {Nordberg, Klas and Knutsson, Hans},
title = {{Some new ideas in Signal Representation}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1991},
}
@inproceedings{diva2:246120,
author = {Bårman, Håkan and Haglund, Leif and Knutsson, Hans and Granlund, Gösta H.},
title = {{Estimation of Velocity, Acceleration and Disparity in Time Sequences}},
booktitle = {Proceedings of IEEE Workshop on Visual Motion},
year = {1991},
pages = {44--51},
publisher = {IEEE Computer Society Press},
}
@inproceedings{diva2:246085,
author = {Bigun, Josef and Granlund, Gösta H. and Wiklund, Johan},
title = {{Multidimensional orientation:
texture analysis and optical flow}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1991},
pages = {110--113},
}
@inproceedings{diva2:245988,
author = {Westelius, Carl-Johan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Focus of attention control}},
booktitle = {Proceedings of the 7th Scandinavian Conference on Image Analysis},
year = {1991},
pages = {667--674},
}
@inproceedings{diva2:245972,
author = {Knutsson, Hans and Haglund, Leif and Bårman, Håkan},
title = {{A Tensor Based Approach to Structure Analysis and Enhancement in 2D, 3D and 4D}},
booktitle = {Workshop Program, Seventh Workshop on Multidimentional Signal Processing},
year = {1991},
publisher = {IEEE Signal Processing Society},
}
@inproceedings{diva2:242250,
author = {Bårman, Håkan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Using Principal Direction Estimates for Shape and Acceleration Description}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1991},
}
@inproceedings{diva2:273866,
author = {Bårman, Håkan and Granlund, Gösta H. and Knutsson, Hans},
title = {{Tensor Field Filtering and Curvature Estimation}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1990},
pages = {175--178},
}
@inproceedings{diva2:246058,
author = {Granlund, Gösta H.},
title = {{Processing and Analysis of Multidimensional Information Using Adaptive Models}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1990},
pages = {19--34},
}
@inproceedings{diva2:245987,
author = {Granlund, Gösta H. and Knutsson, Hans},
title = {{Compact Associative Representation of Visual Information}},
booktitle = {Proceedings of The 10th International Conference on Pattern Recognition},
year = {1990},
}
@inproceedings{diva2:241580,
author = {Knutsson, Hans and Granlund, Gösta H. and Bårman, Håkan},
title = {{A Note on Estimation of 4D Orientation}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1990},
pages = {192--195},
}
@inproceedings{diva2:241574,
author = {Knutsson, Hans and Haglund, Leif and Granlund, Gösta H.},
title = {{A New Approach to Image Enhancement Using Tensor Fields}},
booktitle = {Proceedings of the PROART Workshop on Vision},
year = {1990},
pages = {111--115},
}
@inproceedings{diva2:241565,
author = {Knutsson, Hans and Haglund, Leif and Granlund, Gösta H.},
title = {{Tensor Field Controlled Image Sequence Enhancement}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1990},
pages = {163--167},
}
@inproceedings{diva2:241560,
author = {Westelius, Carl-Johan and Granlund, Gösta H. and Knutsson, Hans},
title = {{Model Projection in a Feature Hierarchy}},
booktitle = {Proceedings of the SSAB Symposium on Image Analysis},
year = {1990},
pages = {244--247},
}
The fundamental problem of finding a suitable representation of the orientation of 3D surfaces is considered. A representation is regarded suitable if it meets three basic requirements: Uniqueness, Uniformity and Polar separability. A suitable tensor representation is given.
At the heart of the problem lies the fact that orientation can only be defined mod 180± , i.e the fact that a 180± rotation of a line or a plane amounts to no change at all. For this reason representing a plane using its normal vector leads to ambiguity and such a representation is consequently not suitable. The ambiguity can be eliminated by establishing a mapping between R3 and a higherdimensional tensor space.
The uniqueness requirement implies a mapping that map all pairs of 3D vectors x and -x onto the same tensor T. Uniformity implies that the mapping implicitly carries a definition of distance between 3D planes (and lines) that is rotation invariant and monotone with the angle between the planes. Polar separability means that the norm of the representing tensor T is rotation invariant. One way to describe the mapping is that it maps a 3D sphere into 6D in such a way that the surface is uniformly uniformly stretched and all pairs of antipodal points maps onto the same tensor.
It is demonstrated that the above mapping can be realized by sampling the 3D space using a specified class of symmetrically distributed quadrature filters. It is shown that 6 quadrature filters are necessary to realize the desired mapping, the orientations of the filters given by lines trough the vertices of an icosahedron. The desired tensor representation can be obtained by simply performing a weighted summation of the quadrature filter outputs. This situation is indeed satisfying as it implies a simple implementation of the theory and that requirements on computational capacity can be kept within reasonable limits.
Noisy neigborhoods and/or linear combinations of tensors produced by the mapping will in general result in a tensor that has no direct counterpart in R3. In an adaptive hierarchical signal processing system, where information is flowing both up (increasing the level of abstraction) and down (for adaptivity and guidance), it is necessary that a meaningful inverse exists for each levelaltering operation. It is shown that the point in R3 that corresponds to the best approximation of a given tensor is given by the largest eigenvalue times the corresponding eigenvector of the tensor.
@inproceedings{diva2:274015,
author = {Knutsson, Hans},
title = {{Representing Local Structure Using Tensors}},
booktitle = {Proceedings of the 6th Scandinavian Conference on Image Analysis},
year = {1989},
series = {LiTH-ISY-I},
volume = {1019},
pages = {244--251},
publisher = {Linköping University Electronic Press},
address = {Linköping},
}
A methodology for spatio-temporal filtering of image sequences is under development at Computer Vision Laboratory, Linköping University. In later years scale analysis has been found to be a necessary tool in image analysis of stationary images. It is our belief that a combination of spatio-temporal filtering and scale analysis is required to get satisfactory results on image sequences. A growing need and the availability of more powerful computers are the most important reasons for this development. The objectives and proposed methods are discussed in relation to known properties of mammal vision.
@inproceedings{diva2:275342,
author = {Wiklund, Johan and Haglund, Leif and Knutsson, Hans and Granlund, Gösta H.},
title = {{Time Sequence Analysis Using Multi-Resolution Spatio-Temporal Filters}},
booktitle = {Time-Varying Image Processing and Moving Object Recognition, 2},
year = {1989},
pages = {258--265},
publisher = {Elsevier Science Publishers},
address = {Amsterdam},
}
@inproceedings{diva2:274025,
author = {Bårman, Håkan and Knutsson, Hans and Granlund, Gösta H.},
title = {{A Filtering Strategy for Orientation and Curvature Description}},
booktitle = {The 6th Scandinavian Conference on Image Analysis},
year = {1989},
pages = {886--889},
}
We have in the preceding sections studlied the use of magnitude representation for feature variables. There are several indications that such a representation may be used in biological visual systerms.
The natural introduction of a nonlinearity may be most useful for many purposes. This has been studied for the implementation of penalty function operations. Such operations show great promise as they can be made very specific based on their zero-crossing property.
There is a great deal of indication that inhibition or penalty mechanisms are very important in neural systems. It has e.g. been found that in the cerebellar structure almost all synapses are inhibitory. This could indicate that inhibitory or penalty matching is a primary mechanism in biological vision systems.
@inproceedings{diva2:274022,
author = {Granlund, Gösta H.},
title = {{Magnitude Representation of Features in Image Analysis}},
booktitle = {Proceedings of the 6th Scandinavian Conference on Image Analysis : Oulu, June 19-22, 1989},
year = {1989},
pages = {212--219},
publisher = {Pattern Recognition Society of Finland},
}
@inproceedings{diva2:274017,
author = {Westelius, Carl-Johan and Westin, Carl-Fredrik},
title = {{A Colour Representation for Scale-spaces}},
booktitle = {The 6th Scandinavian Conference on Image Analysis},
year = {1989},
pages = {890--893},
}
@inproceedings{diva2:274012,
author = {Haglund, Leif and Knutsson, Hans and Granlund, Gösta H.},
title = {{On Phase Representation of Image Information}},
booktitle = {The 6th Scandinavian Conference on Image Analysis},
year = {1989},
pages = {1082--1089},
}
@inproceedings{diva2:274010,
author = {Andersson, Mats and Knutsson, Hans and Granlund, Gösta H.},
title = {{Implementation of Image Processing Operations from Analogue Convolver Responses}},
booktitle = {Proceedings of the SSAB Conference on Image Analysis},
year = {1989},
pages = {67--74},
}
@inproceedings{diva2:274009,
author = {Haglund, Leif and Knutsson, Hans and Granlund, Gösta H.},
title = {{Scale Analysis Using Phase Representation}},
booktitle = {The 6th Scandinavian Conference on Image Analysis},
year = {1989},
pages = {1118--1125},
}
@inproceedings{diva2:274005,
author = {Westelius, Carl-Johan and Westin, Carl-Fredrik},
title = {{Representation of colour in image processing}},
booktitle = {Proceedings of the SSAB Conference on Image Analysis},
year = {1989},
}
@inproceedings{diva2:273872,
author = {Knutsson, Hans and Granlund, Gösta H.},
title = {{Spatio-Temporal Analysis Using Tensors}},
booktitle = {Sixth Multidimensional Signal Processing Workshop},
year = {1989},
}
@inproceedings{diva2:246064,
author = {Bårman, Håkan and Granlund, Gösta H. and Knutsson, Hans},
title = {{A new approach to curvature estimation and description}},
booktitle = {3rd International Conference on Image Processing and its Applications},
year = {1989},
pages = {54--58},
}
@inproceedings{diva2:245965,
author = {Wilson, Roland and Knutsson, Hans},
title = {{A Multiresolution Stereopsis Algorithm Based on the Gabor Representation}},
booktitle = {3rd International Conference on Image Processing and Its Applications},
year = {1989},
pages = {19--22},
}
@inproceedings{diva2:241598,
author = {Granlund, Gösta H.},
title = {{Processing and Analysis of Multidimensional Information Using Adaptive Models}},
booktitle = {Proceedings of the SSAB Conference on Image Analysis},
year = {1989},
pages = {37--44},
}
@inproceedings{diva2:273862,
author = {Granlund, Gösta H.},
title = {{Processing and Analysis of Multi-Dimensional Information Using Adaptive Models}},
booktitle = {Proceedings from SSAB Symposium on Picture Processing},
year = {1988},
}
@inproceedings{diva2:273842,
author = {Bigun, Josef and Granlund, Gösta H.},
title = {{Optical Flow Based on the Inertia Matrix of the Frequency Domain}},
booktitle = {Proceedings from SSAB Symposium on Picture Processing},
year = {1988},
pages = {132--135},
}
A method for modeling symmetries of the neighborhoods in gray-value images is derived. It is based on the form of the iso-gray-value curves. For every neighborhood a complex number is obtained through a convolution of a complex-valued image with a complex-valued filter. The magnitude of the complex number is the degree of symmetry with respect to the a priori chosen harmonic function pair. The degree of symmetry has a clear definition which is based on the error in the Fourier domain. The argument of the complex number is the angle representing the relative dominance of one of the pair of harmonic functions compared to the other.
@inproceedings{diva2:273828,
author = {Bigun, Josef},
title = {{Recognition of Local Symmetries in Gray Value Images by Harmonic Functions}},
booktitle = {Proceedings of the 9th International Conference on Pattern Recognition, Vol. 1},
year = {1988},
pages = {345--347},
}
@inproceedings{diva2:246118,
author = {Bårman, Håkan and Granlund, Gösta H.},
title = {{Corner Detection Using Local Symmetry}},
booktitle = {Proceedings from SSAB Symposium on Picture Processing},
year = {1988},
}
The symmetries in a neighbourhood of a gray value image are modelled by conjugate harmonic function pairs. These are shown to be a suitable curve linear coordinate pair, in which the model represents a neighbourhood. In this representation the image parts, which are symmetric with respect to the chosen function pair, have iso-gray value curves which are simple lines or parallel line patterns. The detection is modelled in the special Fourier domain corresponding to the new variables by minimizing an error function. It is shown that the minimization process or detection of these patterns can be carried out for the whole image entirely in the spatial domain by convolutions. What will be defined as the partial derivative image is the image which takes part in the convolution. The convolution kernel is complex valued, as are the partial derivative image and the result. The magnitudes of the result are shown to correspond to a well defined certainty measure, while the orientation is the least square estimate of an orientation in the Fourier transform corresponding to the harmonic coordinates. Applications to four symmetries are given. These are circular, linear, hyperbolic and parabolic symmetries. Experimental results are presented.
@inproceedings{diva2:246070,
author = {Bigun, Josef},
title = {{Pattern Recognition by detection of local symmetries}},
booktitle = {Pattern Recognition and Artificial Intelligence},
year = {1988},
pages = {75--90},
}
The problem of optimal detection of orientation in arbitrary neighborhoods is solved in the least squares sense. It is shown that this corresponds to fitting an axis in the Fourier domain of the n-dimensional neighborhood, the solution of which is a well known solution of a matrix eigenvalue problem. The eigenvalues are the variance or inertia with respect to the axes given by their respective eigen vectors. The orientation is taken as the axis given by the least eigenvalue. Moreover it is shown that the necessary computations can be pursued in the spatial domain without doing a Fourier transformation. An implementation for 2-D is presented. Two certainty measures are given corresponding to the orientation estimate. These are the relative or the absolute distances between the two eigenvalues, revealing whether the fitted axis is much better than an axis orthogonal to it. The result of the implementation is verified by experiments which confirm an accurate orientation estimation and reliable certainty measure in the presence of additive noise at high level as well as low levels.
@inproceedings{diva2:274026,
author = {Bigun, Josef and Granlund, Gösta H.},
title = {{Optimal Orientation Detection of Linear Symmetry}},
booktitle = {Proceedings of the IEEE First International Conference on Computer Vision},
year = {1987},
pages = {433--438},
}
@inproceedings{diva2:274013,
author = {Bigun, Josef},
title = {{Some Mathematical Tools of Computers for Vision Purposes}},
booktitle = {Proceedings of the 7th Nordic Conference on Teaching of Matematics at Technical Universities},
year = {1987},
}
@inproceedings{diva2:273854,
author = {Wiklund, Johan and Granlund, Gösta H.},
title = {{Image Sequence Analysis for Object Tracking.}},
booktitle = {Proc. of The 5th Scandinavian Conference on Image Analysis},
year = {1987},
pages = {641--648},
}
@inproceedings{diva2:242185,
author = {Granlund, Gösta H.},
title = {{Imprecision of Measurements in Computer Vision Handled by Fuzzy Set Theory}},
booktitle = {5th IEEE-ASSP and EURASIP Workshop on Multidimensional Signal Processing},
year = {1987},
}
@inproceedings{diva2:241576,
author = {Knutsson, Hans},
title = {{A Tensor Representation of 3-D Structures}},
booktitle = {5th IEEE-ASSP and EURASIP Workshop on Multidimensional Signal Processing},
year = {1987},
}
@inproceedings{diva2:274020,
author = {Wiklund, Johan and Granlund, Gösta H.},
title = {{Tracking of Multiple Moving Objects}},
booktitle = {Proceedings of the Second International Workshop on Time-Varying Image Processing and Moving Object Recognition},
year = {1986},
pages = {241--250},
publisher = {Elsevier Science Publishers B.V.},
address = {Amsterdam},
}
A definition of central symmetry for local neighborhoods of 2-D images is given. A complete ON-set of centrally symmetric basis functions is proposed. The local neighborhoods are expanded in this basis. The behavior of coefficient spectrum obtained by this expansion is proposed to be the foundation of central symmetry parameters of the neighbqrhoods. Specifically examination of two such behaviors are proposed: Point concentration and line concentration of the energy spectrum. Moreover, the study of these types of behaviors of the spectrum are shown to be possible to do in the spatial domain.
@inproceedings{diva2:246021,
author = {Bigun, Josef and Granlund, Gösta H.},
title = {{Central Symmetry Modelling}},
booktitle = {Proceedings of EUSIPCO-86, Third European Signal Processing Conference},
year = {1986},
pages = {883--886},
}
@inproceedings{diva2:241595,
author = {Knutsson, Hans},
title = {{Representing and Estimating 3-D Orientation Using Quadrature Filters}},
booktitle = {Conference Publication No. 265, Second Int. Conf. on Image Processing and Its Applications},
year = {1986},
pages = {87--91},
}
@inproceedings{diva2:274016,
author = {Knutsson, Hans},
title = {{Producing a Continuous and Distance Preserving 5-D Vector Representation of 3-D Orientation}},
booktitle = {IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management - CAPAIDM},
year = {1985},
pages = {175--182},
}
@inproceedings{diva2:246055,
author = {Granlund, Gösta H. and Arvidsson, Jan},
title = {{Computer Architectures for Image Processing.}},
booktitle = {Proceedings of The 4th Scandinavian Conference on Image Analysis},
year = {1985},
}
@inproceedings{diva2:273851,
author = {Knutsson, Hans and Granlund, Gösta H.},
title = {{Texture Analysis Using Two-Dimensional Quadrature Filters}},
booktitle = {IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management - CAPAIDM},
year = {1983},
}
@inproceedings{diva2:245962,
author = {Granlund, Gösta H.},
title = {{Hierarchical Image Processing}},
booktitle = {Proceedings of SPIE Technical Conference},
year = {1983},
}
@inproceedings{diva2:246102,
author = {Granlund, Gösta H. and Knutsson, Hans},
title = {{Hierarchical Processing of Structural Information in Artificial Intelligence}},
booktitle = {Proceedings of 1982 IEEE Conference on Acoustics, Speech and Signal Processing},
year = {1982},
}
@inproceedings{diva2:241591,
author = {Hedlund, Martin and Granlund, Gösta H. and Knutsson, Hans},
title = {{A Consistency Operation for Line and Curve Enhancement}},
booktitle = {The Computer Society Conference on PR\&IP},
year = {1982},
}
@inproceedings{diva2:241590,
author = {Granlund, Gösta H. and Arvidsson, Jan and Knutsson, Hans},
title = {{GOP, A Paradigm in Hierarchical Image Processing}},
booktitle = {Proceedings of The First IEEE Computer Society International Symposium on Medical Imaging and Image Interpretation, ISMI II'82},
year = {1982},
}
@inproceedings{diva2:241581,
author = {Wilson, Roland and Knutsson, Hans and Granlund, Gösta H.},
title = {{The Operational Definition of the Position of Line and Edge}},
booktitle = {The 6th International Conference on Pattern Recognition},
year = {1982},
}
@inproceedings{diva2:241570,
author = {Wilson, Roland and Knutsson, Hans and Granlund, Gösta H.},
title = {{Image Coding Using a Predictor Controlled by Image Content}},
booktitle = {Proceedings of 1982 IEEE Conference on Acoustics, speach and signal processing},
year = {1982},
}
The related problems of enhancing and restoring noisy images have received a considerable amount of attention in recent years. Restoration methods have generally been based on minimum mean-squared error operations, such as Wiener filtering or recursive filtering. The rather vague title of enhancement has been given to a wide variety of more or less ad-hoc methods, such as median filtering, which have nonetheless been found useful. In mast cases, however, the aim is the same: an improvement of the subjective quality of the image.
@inproceedings{diva2:273846,
author = {Knutsson, Hans and Wilson, Roland and Granlund, Gösta H.},
title = {{Anisotropic Filtering Controlled by Image Content}},
booktitle = {Proceedings of the 2nd Scandinavian Conference on Image Analysis},
year = {1981},
series = {IEEE Acoustics, Speech, and Signal Processing Newsletter},
volume = {Vol. 50, issue 1},
pages = {146--151},
}
@inproceedings{diva2:241596,
author = {Hedlund, Martin and Granlund, Gösta H. and Knutsson, Hans},
title = {{Image Filtering and Relaxation Procedures using Hierarchical Models}},
booktitle = {Proceedings of the 2nd Scandinavian Conference on Image Analysis},
year = {1981},
}
@inproceedings{diva2:241592,
author = {Knutsson, Hans and Wilson, Roland and Granlund, Gösta H.},
title = {{Anisotropic Filtering Operations for Image Enhancement and their Relation to the Visual System}},
booktitle = {Proceedings of IEEE Computer Society Conference on Pattern Recognition and Image Processing},
year = {1981},
}
@inproceedings{diva2:241585,
author = {Knutsson, Hans and Wilson, Roland and Granlund, Gösta H.},
title = {{Content-Dependent Anisotropic Filtering of Images}},
booktitle = {Proceedings of International Conference on Digital Signal Processing},
year = {1981},
}
@inproceedings{diva2:274006,
author = {Granlund, Gösta},
title = {{Description of texture using the general operator approch}},
booktitle = {5th International Conference on Pattern Recognition},
year = {1980},
pages = {776--779},
}
@inproceedings{diva2:246116,
author = {Knutsson, Hans and Post, B. von and Granlund, Gösta H.},
title = {{Optimization of Arithmetic Neighborhood Operations for Image Processing}},
booktitle = {Proceedings of the First Scandinavian Conference on Image Analysis},
year = {1980},
}
@inproceedings{diva2:242232,
author = {Knutsson, Hans and Granlund, Gösta H.},
title = {{Fourier Domain Design of Line and Edge Detectors}},
booktitle = {Proceedings of the 5th International Conference on Pattern Recognition},
year = {1980},
}
@inproceedings{diva2:246088,
author = {Jilken, L. and Bäcklund, J. and Knutsson, Hans},
title = {{Automatic Fatigue Threshold Value Testing}},
booktitle = {Conf. on Mechanisms of Deformation and Fracture},
year = {1978},
}
@inproceedings{diva2:246097,
author = {Granlund, Gösta H.},
title = {{Pattern Processing Using Multilevel Systems}},
booktitle = {Proceedings of the Eigth Annual Allerton Conference on Circuit and System Theory},
year = {1970},
pages = {445--453},
}
Conference proceedings
This volume constitutes the refereed proceedings of the 21st Scandinavian Conference on Image Analysis, SCIA 2019, held in Norrköping, Sweden, in June 2019.
The 40 revised papers presented were carefully reviewed and selected from 63 submissions. The contributions are structured in topical sections on Deep convolutional neural networks; Feature extraction and image analysis; Matching, tracking and geometry; and Medical and biomedical image analysis.
@proceedings{diva2:1387551,
title = {{Image Analysis}},
year = {2019},
editor = {Felsberg, Michael and Forss\'{e}n, Per-Erik and Sintorn, Ida-Maria and Unger, Jonas},
series = {Image Processing, Computer Vision, Pattern Recognition, and Graphics},
volume = {11482},
publisher = {Springer},
}
The two volume set LNCS 10424 and 10425 constitutes the refereed proceedings of the 17th International Conference on Computer Analysis of Images and Patterns, CAIP 2017, held in Ystad, Sweden, in August 2017.
The 72 papers presented were carefully reviewed and selected from 144 submissions The papers are organized in the following topical sections: Vision for Robotics; Motion and Tracking; Segmentation; Image/Video Indexing and Retrieval; Shape Representation and Analysis; Biomedical Image Analysis; Biometrics; Machine Learning; Image Restoration; and Poster Sessions.
@proceedings{diva2:1366020,
title = {{Computer Analysis of Images and Patterns:
17th International Conference, CAIP 2017, Ystad, Sweden, August 22-24, 2017, Proceedings, Part I}},
year = {2017},
editor = {Felsberg, Michael and Heyden, Anders and Krüger, Norbert},
series = {Lecture Notes in Computer Science},
volume = {10424},
publisher = {Springer},
address = {Cham},
}
The two volume set LNCS 10424 and 10425 constitutes the refereed proceedings of the 17th International Conference on Computer Analysis of Images and Patterns, CAIP 2017, held in Ystad, Sweden, in August 2017. The 72 papers presented were carefully reviewed and selected from 144 submissions The papers are organized in the following topical sections: Vision for Robotics; Motion and Tracking; Segmentation; Image/Video Indexing and Retrieval; Shape Representation and Analysis; Biomedical Image Analysis; Biometrics; Machine Learning; Image Restoration; and Poster Sessions.
@proceedings{diva2:1185567,
title = {{Computer Analysis of Images and Patterns:
17th International Conference, CAIP 2017, Ystad, Sweden, August 22-24, 2017, Proceedings, Part II}},
year = {2017},
editor = {Felsberg, Michael and Heyden, Anders and Krüger, Norbert},
series = {Lecture Notes in Computer Science},
volume = {10425},
publisher = {Springer},
}
On behalf of the Organizing Committee, it is my honor and privilege to present the scientific program of the 22nd International Conference on Pattern Recognition. ICPR 2014 is hosted by the Swedish Society for Automated Image Analysis (SSBA) and supported by the universities of Linkoping, Lund and Uppsala.
ICPR 2014 has five scientific tracks: Computer Vision; Pattern Recognition and Machine Learning; Image, Speech, Signal and Video Processing; Document Analysis, Biometrics and Pattern Recognition Applications; and Biomedical Image Analysis. For each track there is an Invited Speaker who will share their deep knowledge and experience with us. The perhaps most apparent novelty in this ICPR is the change from four to six paged papers, which is significantly more than a 50% increase in the actual content, disregarding the title, abstract and reference list. Our hope and belief is that this has improved the possibility for the reviewers to make well-justified evaluations of the manuscripts, and also improved the readability of the final papers and, as a consequence, improved the general quality of the accepted papers.
The organization of ICPR 2014 would not have been possible without the generous contributions by our major partners, The City of Stockholm, SSBA, eSSENCE and SeRC. Also the financial contributions of our other partners and exhibitors as well as the technical co-sponsorship by IEEE Computer Society are gratefully acknowledged, and so is the support and advices from IAPR and the ICPR Liaison Committee. I also want to express my sincere gratitude to the Program and Publication Chairs, the Track Chairs, Area Chairs and all reviewers for their great efforts in putting this scientific program together. And, perhaps most of all, I want to thank all the contributing authors who filled it with contents of highest scientific quality. Finally, I would like to express my gratitude to all attendees. Without your presence, there simply wouldn't be any conference.
@proceedings{diva2:850266,
title = {{Proceedings. 22nd International Conferenceon Pattern Recognition ICPR 2014, 24-28 August 2014, Stockholm, Sweden}},
year = {2014},
editor = {Heyden, Anders and Laurendeau, Denis and Felsberg, Michael and Borga, Magnus},
series = {Conference on Pattern Recognition (CPR)},
volume = {1-6},
publisher = {IEEE conference proceedings},
}
This book constitutes the refereed proceedings of the 33rd Symposium of the German Association for Pattern Recognition, DAGM 2011, held in Frankfurt/Main, Germany, in August/September 2011. The 20 revised full papers and 22 revised poster papers were carefully reviewed and selected from 98 submissions. The papers are organized in topical sections on object recognition, adverse vision conditions challenge, shape and matching, segmentation and early vision, robot vision, machine learning, and motion. The volume also includes the young researcher's forum, a section where a carefully jury-selected ensemble of young researchers present their Master thesis work.
@proceedings{diva2:850263,
title = {{Pattern Recognition:
33rd DAGM Symposium, Frankfurt/Main, Germany, August 31 - September 2, 2011, Proceedings}},
year = {2011},
editor = {Mester, Rudolf and Felsberg, Michael},
series = {Lecture Notes in Computer Science},
volume = {6835},
publisher = {Springer},
}
Theses
A mobile robot, instructed by a human operator, acts in an environment with many other objects. However, for an autonomous robot, human instructions should be minimal and only high-level instructions, such as the ultimate task or destination. In order to increase the level of autonomy, it has become a foremost objective to mimic human vision using neural networks that take a stream of images as input and learn a specific computer vision task from large amounts of data. In this thesis, we explore several different models for surround sensing, each of which contributes to a higher understanding of the environment being possible.
As its first contribution, this thesis presents an object tracking method for video sequences, which is a crucial component in a perception system. This method predicts a fine-grained mask to separate the pixels corresponding to the target from those corresponding to the background. Rather than tracking location and size, the method tracks the initial pixels assigned to the target in this so-called video object segmentation. For subsequent time steps, the goal is to learn how the target looks using features from a neural network. We named our method A-GAME, based on the generative modeling of deep feature space, separating target and background appearances.
In the second contribution of this thesis, we detect, track, and segment all objects from a set of predefined object classes. This information is how the robot increases its capabilities to perceive the surroundings. We experiment with a graph neural network to weigh all new detections and existing tracks. This model outperforms prior works by separating visually, and semantically similar objects frame by frame.
The third contribution investigates one limitation of anchor-based detectors, which classify pre-defined bounding boxes as either negative or positive and thus provide a limited set of handled object shapes. One idea is to learn an alternative instance representation. We experiment with a neural network that predicts the distance to the nearest object contour in different directions from each pixel. The network then computes an approximated signed distance function containing the respective instance information.
Last, this thesis studies a concept within model validation. We observed that overfitting could increase performance on benchmarks. However, this opportunity is insipid for sensing systems in practice since measurements, such as length or angles, are quantities that explain the environment. The fourth contribution of this thesis is an extended validation technique for camera calibration. This technique uses a statistical model for each error difference between an observed value and a corresponding prediction of the projective model. We compute a test over the differences and detect if the projective model is incorrect.
@phdthesis{diva2:1745714,
author = {Brissman, Emil},
title = {{Learning to Analyze Visual Data Streams for Environment Perception}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2283}},
year = {2023},
address = {Sweden},
}
As technology continues to advance, the interest in the relief of humans from tedious or dangerous tasks through automation increases. Some of the tasks that have received increasing attention are autonomous driving, disaster relief, and forestry inspection. Developing and deploying an autonomous robotic system to this type of unconstrained environments —in a safe way— is highly challenging. The system requires precise control and high-level decision making. Both of which require a robust and reliable perception system to understand the surroundings correctly.
The main purpose of perception is to extract meaningful information from the environment, be it in the form of 3D maps, dense classification of the type of object and surfaces, or high-level information about the position and direction of moving objects. Depending on the limitations and application of the system, various types of sensors can be used: lidars, to collect sparse depth information; cameras, to collect dense information for different parts of the visual spectra, of-ten the red-green-blue (RGB) bands; Inertial Measurements Units (IMUs), to estimate the ego motion; microphones, to interact and respond to humans; GPS receivers, to get global position information; just to mention a few.
This thesis investigates some of the necessities to approach the requirements of this type of system. Specifically, focusing on data-driven approaches, that is, machine learning, which has been shown time and again to be the main competitor for high-performance perception tasks in recent years. Although precision requirements might be high in industrial production plants, the environment is relatively controlled and the task is fixed. Instead, this thesis is studying some of the aspects necessary for complex, unconstrained environments, primarily outdoors and potentially near humans or other systems. The term in the wild refers exactly to the unconstrained nature of these environments, where the system can easily encounter something previously unseen and where the system might interact with unknowing humans. Some examples of environments are: city traffic, disaster relief scenarios, and dense forests.
This thesis will mainly focus on the following three key aspects necessary to handle the types of tasks and situations that could occur in the wild: 1) generalizing to a new environment, 2) adapting to new tasks and requirements, and 3) modeling uncertainty in the perception system.
First, a robotic system should be able to generalize to new environments and still function reliably. Papers B and G address this by using an intermediate representation to allow the system to handle much more diverse types of environment than otherwise possible. Paper B also investigates how robust the proposed autonomous driving system was to incorrect predictions, which is one of the likely results of changing the environment.
Second, a robot should be sufficiently adaptive to allow it to learn new tasks without forgetting the previous ones. Paper E proposed a way to allow incrementally adding new semantic classes to a trained model without access to the previous training data. The approach is based on utilizing the uncertainty in the predictions to model the unknown classes, marked as background.
Finally, the perception system will always be partially flawed, either because of the lack of modeling capabilities or because of ambiguities in the sensor data. To properly take this into account, it is fundamental that the system has the ability to estimate the certainty in the predictions. Paper F proposed a method for predicting the uncertainty in the model predictions when interpolating sparse data. Paper G addresses the ambiguities that exist when estimating the 3D pose of a human from a single camera image.
@phdthesis{diva2:1740415,
author = {Holmquist, Karl},
title = {{Data-Driven Robot Perception in the Wild}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2293}},
year = {2023},
address = {Sweden},
}
Connected and autonomous vehicles (CAVs) are an emerging trend in the transport sector and their impact on transportation, the economy, society and the environment will be tremendous. Much like the automobile shaped the way humans travelled, lived and worked during the 20th century, CAVs have yet again the potential to affect and reform all of these areas. Besides the imminent technological challenges on the robotic aspect of making CAVs become a market-ready reality, a plethora of ethical, social and legal questions will have to be addressed along the line. Knowledge of and interaction with the surrounding infrastructure and other actors in the system will be essential for CAVs in order to pave the way for progressive solutions to urgent sustainability and mobility issues in transportation.
Road networks, i.e. the networks of roads and intersections, are the core infrastructure on which CAVs will operate. Thus, having detailed knowledge about them is key for CAVs in order to take the right decisions on both short-term actions that will affect individual traffic users in immediate situations and long-term actions that will affect entire transportation systems in the long run. Machine learning is nowadays a popular choice to extract and conglomerate knowledge from large amounts of data – and large amounts of data can be obtained about road networks. However, classical machine learning models are incapable of harnessing the graph-structured nature of road networks sufficiently.
Graph neural networks (GNNs) are machine learning models of growing popularity that can explicitly leverage the complex topological structure of node dependencies in graphs, such as the ones observed in road networks. Road networks are sparse graphs that reside in a euclidean space, and therefore different to typical graphs studied in the literature. Also, crowd-sourced road network graphs often have incomplete attributes and are generally lacking the fine-grained level of detail in their encoded information that would be required for CAVs. Identifying the best representation of road network graphs and complementing their lacking detail with auxiliary data is therefore an important research direction.
This thesis, therefore, addresses data-driven classification in road networks from two directions: A) the general approach of learning on spatial graphs of road network with GNNs, and B) complementing road network graphs with auxiliary data. Specifically, this thesis and the included papers address the exemplary task of road classifications and make the following contributions to the field:
Paper A analyses how GNNs can be applied to road networks and how the networks are best represented. Different aggregator functions are compared on final classification performances. A novel aggregator and a neighbourhood sampling method are introduced, and the line graph transformation is identified as a suitable representation of road network graphs for GNNs.
Paper B complements the road network graphs with mobility data from millions of GPS trajectories and introduces an equitemporal node spacing to create road segments of equal travel time. It further introduces remote sensing vision data as a potent complement to overcome shortcom-ings of the graph-based representation for road networks. Simple hand-crafted low-level vision features are used in this work. However, both the equitemporal node spacing and the simple vision features clearly exhibit improved classification performances.
Finally, Paper C consolidates the complement of remote sensing data to the road network graphs. Through a general visual feature encoding of state-of-the-art pretrained vision back-bones that are carefully fine-tuned to the remote sensing domain, a further performance boost on the road classification task is achieved.
@phdthesis{diva2:1647474,
author = {Stromann, Oliver},
title = {{Data-Driven Classification in Road Networks}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Licentiate Thesis No. 1933}},
year = {2022},
address = {Sweden},
}
Vision is the primary means by which we know where we are, what is nearby, and how we are moving. The corresponding computer-vision task is the simultaneous mapping of the surroundings and the localization of the camera. This goes by many names of which this thesis uses Visual Odometry. This name implies the images are sequential and emphasizes the accuracy of the pose and the real time requirements. This field has seen substantial improvements over the past decade and visual odometry is used extensively in robotics for localization, navigation and obstacle detection.
The main purpose of this thesis is the study and advancement of visual odometry systems, and makes several contributions. The first of which is a high performance stereo visual odometry system, which through geometrically supported tracking achieved top rank on the KITTI odometry benchmark.
The second is the state-of-the-art perspective three point solver. Such solvers find the pose of a camera given the projections of three known 3d points and are a core part of many visual odometry systems. By reformulating the underlying problem we avoided a problematic quartic polynomial. As a result we achieved substantially higher computational performance and numerical accuracy.
The third is a system which generalizes stereo visual odometry to the simultaneous estimation of multiple independently moving objects. The main contribution is a real time system which allows the identification of generic moving rigid objects and the prediction of their trajectories in real time, with applications to robotic navigation in in dynamic environments.
The fourth is an improved spline type continuous pose trajectory estimation framework, which simplifies the integration of general dynamic models. The framework is used to show that visual odometry systems based on continuous pose trajectories are both practical and can operate in real time.
The visual odometry pipeline is considered from both a theoretical and a practical perspective. The systems described have been tested both on benchmarks and real vehicles. This thesis places the published work into context, highlighting key insights and practical observations.
@phdthesis{diva2:1635583,
author = {Persson, Mikael},
title = {{Visual Odometryin Principle and Practice}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2201}},
year = {2022},
address = {Sweden},
}
Autonomous robots act in a \emph{dynamic} world where both the robots and other objects may move. The surround sensing systems of said robots therefore work with dynamic input data and need to estimate both the current state of the environment as well as its dynamics. One of the key elements to obtain a high-level understanding of the environment is to track dynamic objects. This enables the system to understand what the objects are doing; predict where they will be in the future; and in the future better estimate where they are. In this thesis, I focus on input from visual cameras, images. Images have, with the advent of neural networks, become a cornerstone in sensing systems. Image-processing neural networks are optimized to perform a specific computer vision task -- such as recognizing cats and dogs -- on vast datasets of annotated examples. This is usually referred to as \emph{offline training} and given a well-designed neural network, enough high-quality data, and a suitable offline training formulation, the neural network is expected to become adept at the specific task.
This thesis starts with a study of object tracking. The tracking is based on the visual appearance of the object, achieved via discriminative convolution filters (DCFs). The first contribution of this thesis is to decompose the filter into multiple subfilters. This serves to increase the robustness during object deformations or rotations. Moreover, it provides a more fine-grained representation of the object state as the subfilters are expected to roughly track object parts. In the second contribution, a neural network is trained directly for object tracking. In order to obtain a fine-grained representation of the object state, it is represented as a segmentation. The main challenge lies in the design of a neural network able to tackle this task. While the common neural networks excel at recognizing patterns seen during offline training, they struggle to store novel patterns in order to later recognize them. To overcome this limitation, a novel appearance learning mechanism is proposed. The mechanism extends the state-of-the-art and is shown to generalize remarkably well to novel data. In the third contribution, the method is used together with a novel fusion strategy and failure detection criterion to semi-automatically annotate visual and thermal videos.
Sensing systems need not only track objects, but also detect them. The fourth contribution of this thesis strives to tackle joint detection, tracking, and segmentation of all objects from a predefined set of object classes. The challenge here lies not only in the neural network design, but also in the design of the offline training formulation. The final approach, a recurrent graph neural network, outperforms prior works that have a runtime of the same order of magnitude.
Last, this thesis studies \emph{dynamic} learning of novel visual concepts. It is observed that the learning mechanisms used for object tracking essentially learns the appearance of the tracked object. It is natural to ask whether this appearance learning could be extended beyond individual objects to entire semantic classes, enabling the system to learn new concepts based on just a few training examples. Such an ability is desirable in autonomous systems as it removes the need of manually annotating thousands of examples of each class that needs recognition. Instead, the system is trained to efficiently learn to recognize new classes. In the fifth contribution, we propose a novel learning mechanism based on Gaussian process regression. With this mechanism, our neural network outperforms the state-of-the-art and the performance gap is especially large when multiple training examples are given.
To summarize, this thesis studies and makes several contributions to learning systems that parse dynamic visuals and that dynamically learn visual appearances or concepts.
@phdthesis{diva2:1616651,
author = {Johnander, Joakim},
title = {{Dynamic Visual Learning}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2196}},
year = {2022},
address = {Sweden},
}
In computer vision, the aim is to model and extract high-level information from visual sensor measurements such as images, videos and 3D points. Since visual data is often high-dimensional, noisy and irregular, achieving robust data modeling is challenging. This thesis presents works that address challenges within a number of different computer vision problems.
First, the thesis addresses the problem of phase unwrapping for multi-frequency amplitude modulated time-of-flight (ToF) ranging. ToF is used in depth cameras, which have many applications in 3D reconstruction and gesture recognition. While amplitude modulation in time-of-flight ranging can provide accurate measurements for the depth, it also causes depth ambiguities. This thesis presents a method to resolve the ambiguities by estimating the likelihoods of different hypotheses for the depth values. This is achieved by performing kernel density estimation over the hypotheses in a spatial neighborhood of each pixel in the depth image. The depth hypothesis with the highest estimated likelihood can then be selected as the output depth. This approach yields improvements in the quality of the depth images and extends the effective range in both indoor and outdoor environments.
Next, point set registration is investigated, which is the problem of aligning point sets from overlapping depth images or 3D models. Robust registration is fundamental to many vision tasks, such as multi-view 3D reconstruction and object pose estimation for robotics. The thesis presents a method for handling density variations in the measured point sets. This is achieved by modeling a latent distribution representing the underlying structure of the scene. Both the model of the scene and the registration parameters are inferred in an Expectation-Maximization based framework. Secondly, the thesis introduces a method for integrating features from deep neural networks into the registration model. It is shown that the deep features improve registration performance in terms of accuracy and robustness. Additionally, improved feature representations are generated by training the deep neural network end-to-end by minimizing registration errors produced by our registration model.
Further, an approach for 3D point set segmentation is presented. As scene models are often represented using 3D point measurements, segmentation of these is important for general scene understanding. Learning models for segmentation requires a significant amount of annotated data, which is expensive and time-consuming to acquire. The approach presented in the thesis circumvents this by projecting the points into virtual camera views and render 2D images. The method can then exploit accurate convolutional neural networks for image segmentation and map the segmentation predictions back to the 3D points. This also allows for transferring learning using available annotated image data, thereby reducing the need for 3D annotations.
Finally, the thesis explores the problem of video object segmentation (VOS), where the task is to track and segment target objects in each frame of a video sequence. Accurate VOS requires a robust model of the target that can adapt to different scenarios and objects. This needs to be achieved using only a single labeled reference frame as training data for each video sequence. To address the challenges in VOS, the thesis introduces a parametric target model, optimized to predict a target label derived from the mask annotation. The target model is integrated into a deep neural network, where its predictions guide a decoder module to produce target segmentation masks. The deep network is trained on labeled video data to output accurate segmentation masks for each frame. Further, it is shown that by training the entire network model in an end-to-end manner, it can learn a representation of the target that provides increased segmentation accuracy.
@phdthesis{diva2:1559711,
author = {Järemo Lawin, Felix},
title = {{Learning Representations for Segmentation and Registration}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2151}},
year = {2021},
address = {Sweden},
}
Early computer vision algorithms operated on dense 2D images captured using conventional monocular or color sensors. Those sensors embrace a passive nature providing limited scene representations based on light reflux, and are only able to operate under adequate lighting conditions. These limitations hindered the development of many computer vision algorithms that require some knowledge of the scene structure under varying conditions. The emergence of active sensors such as Time-of-Flight (ToF) cameras contributed to mitigating these limitations; however, they gave a rise to many novel challenges, such as data sparsity that stems from multi-path interference, and occlusion.
Many approaches have been proposed to alleviate these challenges by enhancing the acquisition process of ToF cameras or by post-processing their output. Nonetheless, these approaches are sensor and model specific, requiring an individual tuning for each sensor. Alternatively, learning-based approaches, i.e., machine learning, are an attractive solution to these problems by learning a mapping from the original sensor output to a refined version of it. Convolutional Neural Networks (CNNs) are one example of powerful machine learning approaches and they have demonstrated a remarkable success on many computer vision tasks. Unfortunately, CNNs naturally operate on dense data and cannot efficiently handle sparse data from ToF sensors.
In this thesis, we propose a novel variation of CNNs denoted as the Normalized Convolutional Neural Networks that can directly handle sparse data very efficiently. First, we formulate a differentiable normalized convolution layer that takes in sparse data and a confidence map as input. The confidence map provides information about valid and missing pixels to the normalized convolution layer, where the missing values are interpolated from their valid vicinity. Afterwards, we propose a confidence propagation criterion that allows building cascades of normalized convolution layers similar to the standard CNNs. We evaluated our approach on the task of unguided scene depth completion and achieved state-of-the-art results using an exceptionally small network.
As a second contribution, we investigated the fusion of a normalized convolution network with standard CNNs employing RGB images. We study different fusion schemes, and we provide a thorough analysis for different components of the network. By employing our best fusion strategy, we achieve state-of-the-art results on guided depth completion using a remarkably small network.
Thirdly, to provide a statistical interpretation for confidences, we derive a probabilistic framework for the normalized convolutional neural networks. This framework estimates the input confidence in a self-supervised manner and propagates it to provide a statistically valid output confidence. When compared against existing approaches for uncertainty estimation in CNNs such as Bayesian Deep Learning, our probabilistic framework provides a higher quality measure of uncertainty at a significantly lower computational cost.
Finally, we attempt to employ our framework in a common task in CNNs, namely upsampling. We formulate the upsampling problem as a sparse problem, and we employ the normalized convolutional neural networks to solve it. In comparison to existing approaches, our proposed upsampler is structure-aware while being light-weight. We test our upsampler with various optical flow estimation networks, and we show that it consistently improves the results. When integrated with a recent optical flow network, it sets a new state-of-the-art on the most challenging optical flow dataset.
@phdthesis{diva2:1547851,
author = {Eldesokey, Abdelrahman},
title = {{Uncertainty-Aware Convolutional Neural Networks for Vision Tasks on Sparse Data}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2123}},
year = {2021},
address = {Sweden},
}
In less than ten years, deep neural networks have evolved into all-encompassing tools in multiple areas of science and engineering, due to their almost unreasonable effectiveness in modeling complex real-world relationships. In computer vision in particular, they have taken tasks such as object recognition, that were previously considered very difficult, and transformed them into everyday practical tools. However, neural networks have to be trained with supercomputers on massive datasets for hours or days, and this limits their ability adjust to changing conditions.
This thesis explores discriminative correlation filters, originally intended for tracking large objects in video, so-called visual object tracking. Unlike neural networks, these filters are small and can be quickly adapted to changes, with minimal data and computing power. At the same time, they can take advantage of the computing infrastructure developed for neural networks and operate within them.
The main contributions in this thesis demonstrate the versatility and adaptability of correlation filters for various problems, while complementing the capabilities of deep neural networks. In the first problem, it is shown that when adopted to track small regions and points, they outperform the widely used Lucas-Kanade method, both in terms of robustness and precision.
In the second problem, the correlation filters take on a completely new task. Here, they are used to tell different places apart, in a 16 by 16 square kilometer region of ocean near land. Given only a horizon profile - the coast line silhouette of islands and islets as seen from an ocean vessel - it is demonstrated that discriminative correlation filters can effectively distinguish between locations.
In the third problem, it is shown how correlation filters can be applied to video object segmentation. This is the task of classifying individual pixels as belonging either to a target or the background, given a segmentation mask provided with the first video frame as the only guidance. It is also shown that discriminative correlation filters and deep neural networks complement each other; where the neural network processes the input video in a content-agnostic way, the filters adapt to specific target objects. The joint function is a real-time video object segmentation method.
Finally, the segmentation method is extended beyond binary target/background classification to additionally consider distracting objects. This addresses the fundamental difficulty of coping with objects of similar appearance.
@phdthesis{diva2:1545394,
author = {Robinson, Andreas},
title = {{Discriminative correlation filters in robot vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2146}},
year = {2021},
address = {Sweden},
}
In the last decade, developments in hardware, sensors and software have made it possible to create increasingly autonomous systems. These systems can be as simple as limited driver assistance software lane-following in cars, or limited collision warning systems for otherwise manually piloted drones. On the other end of the spectrum exist fully autonomous cars, boats or helicopters. With increasing abilities to function autonomously, the demands to operate with minimal human supervision in unstructured environments increase accordingly.
Common to most, if not all, autonomous systems is that they require an accurate model of the surrounding world. While there is currently a large number of possible sensors useful to create such models available, cameras are one of the most versatile. From a sensing perspective cameras have several advantages over other sensors in that they require no external infrastructure, are relatively cheap and can be used to extract such information as the relative positions of other objects, their movements over time, create accurate maps and locate the autonomous system within these maps.
Using cameras to produce a model of the surroundings require solving a number of technical problems. Often these problems have a basis in recognizing that an object or region of interest is the same over time or in novel viewpoints. In visual tracking this type of recognition is required to follow an object of interest through a sequence of images. In geometric problems it is often a requirement to recognize corresponding image regions in order to perform 3D reconstruction or localization.
The first set of contributions in this thesis is related to the improvement of a class of on-line learned visual object trackers based on discriminative correlation filters. In visual tracking estimation of the objects size is important for reliable tracking, the first contribution in this part of the thesis investigates this problem. The performance of discriminative correlation filters is highly dependent on what feature representation is used by the filter. The second tracking contribution investigates the performance impact of different features derived from a deep neural network.
A second set of contributions relate to the evaluation of visual object trackers. The first of these are the visual object tracking challenge. This challenge is a yearly comparison of state-of-the art visual tracking algorithms. A second contribution is an investigation into the possible issues when using bounding-box representations for ground-truth data.
In real world settings tracking typically occur over longer time sequences than is common in benchmarking datasets. In such settings it is common that the model updates of many tracking algorithms cause the tracker to fail silently. For this reason it is important to have an estimate of the trackers performance even in cases when no ground-truth annotations exist. The first of the final three contributions investigates this problem in a robotics setting, by fusing information from a pre-trained object detector in a state-estimation framework. An additional contribution describes how to dynamically re-weight the data used for the appearance model of a tracker. A final contribution investigates how to obtain an estimate of how certain detections are in a setting where geometrical limitations can be imposed on the search region. The proposed solution learns to accurately predict stereo disparities along with accurate assessments of each predictions certainty.
@phdthesis{diva2:1545918,
author = {Häger, Gustav},
title = {{Learning visual perception for autonomous systems}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2138}},
year = {2021},
address = {Sweden},
}
Thermal cameras have historically been of interest mainly for military applications. Increasing image quality and resolution combined with decreasing camera price and size during recent years have, however, opened up new application areas. They are now widely used for civilian applications, e.g., within industry, to search for missing persons, in automotive safety, as well as for medical applications. Thermal cameras are useful as soon as there exists a measurable temperature difference. Compared to cameras operating in the visual spectrum, they are advantageous due to their ability to see in total darkness, robustness to illumination variations, and less intrusion on privacy.
This thesis addresses the problem of automatic image analysis in thermal infrared images with a focus on machine learning methods. The main purpose of this thesis is to study the variations of processing required due to the thermal infrared data modality. In particular, three different problems are addressed: visual object tracking, anomaly detection, and modality transfer. All these are research areas that have been and currently are subject to extensive research. Furthermore, they are all highly relevant for a number of different real-world applications.
The first addressed problem is visual object tracking, a problem for which no prior information other than the initial location of the object is given. The main contribution concerns benchmarking of short-term single-object (STSO) visual object tracking methods in thermal infrared images. The proposed dataset, LTIR (Linköping Thermal Infrared), was integrated in the VOT-TIR2015 challenge, introducing the first ever organized challenge on STSO tracking in thermal infrared video. Another contribution also related to benchmarking is a novel, recursive, method for semi-automatic annotation of multi-modal video sequences. Based on only a few initial annotations, a video object segmentation (VOS) method proposes segmentations for all remaining frames and difficult parts in need for additional manual annotation are automatically detected. The third contribution to the problem of visual object tracking is a template tracking method based on a non-parametric probability density model of the object's thermal radiation using channel representations.
The second addressed problem is anomaly detection, i.e., detection of rare objects or events. The main contribution is a method for truly unsupervised anomaly detection based on Generative Adversarial Networks (GANs). The method employs joint training of the generator and an observation to latent space encoder, enabling stratification of the latent space and, thus, also separation of normal and anomalous samples. The second contribution is the previously unaddressed problem of obstacle detection in front of moving trains using a train-mounted thermal camera. Adaptive correlation filters are updated continuously and missed detections of background are treated as detections of anomalies, or obstacles. The third contribution to the problem of anomaly detection is a method for characterization and classification of automatically detected district heat leakages for the purpose of false alarm reduction.
Finally, the thesis addresses the problem of modality transfer between thermal infrared and visual spectrum images, a previously unaddressed problem. The contribution is a method based on Convolutional Neural Networks (CNNs), enabling perceptually realistic transformations of thermal infrared to visual images. By careful design of the loss function the method becomes robust to image pair misalignments. The method exploits the lower acuity for color differences than for luminance possessed by the human visual system, separating the loss into a luminance and a chrominance part.
@phdthesis{diva2:1365154,
author = {Berg, Amanda},
title = {{Learning to Analyze what is Beyond the Visible Spectrum}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 2024}},
year = {2019},
address = {Sweden},
}
Over the last decade, the usage of unmanned systems such as Unmanned Aerial Vehicles (UAVs), Unmanned Surface Vessels (USVs) and Unmanned Ground Vehicles (UGVs) has increased drastically, and there is still a rapid growth. Today, unmanned systems are being deployed in many daily operations, e.g. for deliveries in remote areas, to increase efficiency of agriculture, and for environmental monitoring at sea. For safety reasons, unmanned systems are often the preferred choice for surveillance missions in hazardous environments, e.g. for detection of nuclear radiation, and in disaster areas after earthquakes, hurricanes, or during forest fires. For safe navigation of the unmanned systems during their missions, continuous and accurate global localization and attitude estimation is mandatory.
Over the years, many vision-based methods for position estimation have been developed, primarily for urban areas. In contrast, this thesis is mainly focused on vision-based methods for accurate position and attitude estimates in natural environments, i.e. beyond the urban areas. Vision-based methods possess several characteristics that make them appealing as global position and attitude sensors. First, vision sensors can be realized and tailored for most unmanned vehicle applications. Second, geo-referenced terrain models can be generated worldwide from satellite imagery and can be stored onboard the vehicles. In natural environments, where the availability of geo-referenced images in general is low, registration of image information with terrain models is the natural choice for position and attitude estimation. This is the problem area that I addressed in the contributions of this thesis.
The first contribution is a method for full 6DoF (degrees of freedom) pose estimation from aerial images. A dense local height map is computed using structure from motion. The global pose is inferred from the 3D similarity transform between the local height map and a digital elevation model. Aligning height information is assumed to be more robust to season variations than feature-based matching.
The second contribution is a method for accurate attitude (pitch and roll angle) estimation via horizon detection. It is one of only a few methods that use an omnidirectional (fisheye) camera for horizon detection in aerial images. The method is based on edge detection and a probabilistic Hough voting scheme. The method allows prior knowledge of the attitude angles to be exploited to make the initial attitude estimates more robust. The estimates are then refined through registration with the geometrically expected horizon line from a digital elevation model. To the best of our knowledge, it is the first method where the ray refraction in the atmosphere is taken into account, which enables the highly accurate attitude estimates.
The third contribution is a method for position estimation based on horizon detection in an omnidirectional panoramic image around a surface vessel. Two convolutional neural networks (CNNs) are designed and trained to estimate the camera orientation and to segment the horizon line in the image. The MOSSE correlation filter, normally used in visual object tracking, is adapted to horizon line registration with geometric data from a digital elevation model. Comprehensive field trials conducted in the archipelago demonstrate the GPS-level accuracy of the method, and that the method can be trained on images from one region and then applied to images from a previously unvisited test area.
The CNNs in the third contribution apply the typical scheme of convolutions, activations, and pooling. The fourth contribution focuses on the activations and suggests a new formulation to tune and optimize a piecewise linear activation function during training of CNNs. Improved classification results from experiments when tuning the activation function led to the introduction of a new activation function, the Shifted Exponential Linear Unit (ShELU).
@phdthesis{diva2:1303454,
author = {Grelsson, Bertil},
title = {{Vision-based Localization and Attitude Estimation Methods in Natural Environments}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1977}},
year = {2019},
address = {Sweden},
}
Using images to reconstruct the world in three dimensions is a classical computer vision task. Some examples of applications where this is useful are autonomous mapping and navigation, urban planning, and special effects in movies. One common approach to 3D reconstruction is ”structure from motion” where a scene is imaged multiple times from different positions, e.g. by moving the camera. However, in a twist of irony, many structure from motion methods work best when the camera is stationary while the image is captured. This is because the motion of the camera can cause distortions in the image that lead to worse image measurements, and thus a worse reconstruction. One such distortion common to all cameras is motion blur, while another is connected to the use of an electronic rolling shutter. Instead of capturing all pixels of the image at once, a camera with a rolling shutter captures the image row by row. If the camera is moving while the image is captured the rolling shutter causes non-rigid distortions in the image that, unless handled, can severely impact the reconstruction quality.
This thesis studies methods to robustly perform 3D reconstruction in the case of a moving camera. To do so, the proposed methods make use of an inertial measurement unit (IMU). The IMU measures the angular velocities and linear accelerations of the camera, and these can be used to estimate the trajectory of the camera over time. Knowledge of the camera motion can then be used to correct for the distortions caused by the rolling shutter. Another benefit of an IMU is that it can provide measurements also in situations when a camera can not, e.g. because of excessive motion blur, or absence of scene structure.
To use a camera together with an IMU, the camera-IMU system must be jointly calibrated. The relationship between their respective coordinate frames need to be established, and their timings need to be synchronized. This thesis shows how to automatically perform this calibration and synchronization, without requiring e.g. calibration objects or special motion patterns.
In standard structure from motion, the camera trajectory is modeled as discrete poses, with one pose per image. Switching instead to a formulation with a continuous-time camera trajectory provides a natural way to handle rolling shutter distortions, and also to incorporate inertial measurements. To model the continuous-time trajectory, many authors have used splines. The ability for a spline-based trajectory to model the real motion depends on the density of its spline knots. Choosing a too smooth spline results in approximation errors. This thesis proposes a method to estimate the spline approximation error, and use it to better balance camera and IMU measurements, when used in a sensor fusion framework. Also proposed is a way to automatically decide how dense the spline needs to be to achieve a good reconstruction.
Another approach to reconstruct a 3D scene is to use a camera that directly measures depth. Some depth cameras, like the well-known Microsoft Kinect, are susceptible to the same rolling shutter effects as normal cameras. This thesis quantifies the effect of the rolling shutter distortion on 3D reconstruction, depending on the amount of motion. It is also shown that a better 3D model is obtained if the depth images are corrected using inertial measurements.
@phdthesis{diva2:1220622,
author = {Ovr\'{e}n, Hannes},
title = {{Continuous Models for Cameras and Inertial Sensors}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1951}},
year = {2018},
address = {Sweden},
}
Visual tracking is one of the fundamental problems in computer vision. Its numerous applications include robotics, autonomous driving, augmented reality and 3D reconstruction. In essence, visual tracking can be described as the problem of estimating the trajectory of a target in a sequence of images. The target can be any image region or object of interest. While humans excel at this task, requiring little effort to perform accurate and robust visual tracking, it has proven difficult to automate. It has therefore remained one of the most active research topics in computer vision.
In its most general form, no prior knowledge about the object of interest or environment is given, except for the initial target location. This general form of tracking is known as generic visual tracking. The unconstrained nature of this problem makes it particularly difficult, yet applicable to a wider range of scenarios. As no prior knowledge is given, the tracker must learn an appearance model of the target on-the-fly. Cast as a machine learning problem, it imposes several major challenges which are addressed in this thesis.
The main purpose of this thesis is the study and advancement of the, so called, Discriminative Correlation Filter (DCF) framework, as it has shown to be particularly suitable for the tracking application. By utilizing properties of the Fourier transform, a correlation filter is discriminatively learned by efficiently minimizing a least-squares objective. The resulting filter is then applied to a new image in order to estimate the target location.
This thesis contributes to the advancement of the DCF methodology in several aspects. The main contribution regards the learning of the appearance model: First, the problem of updating the appearance model with new training samples is covered. Efficient update rules and numerical solvers are investigated for this task. Second, the periodic assumption induced by the circular convolution in DCF is countered by proposing a spatial regularization component. Third, an adaptive model of the training set is proposed to alleviate the impact of corrupted or mislabeled training samples. Fourth, a continuous-space formulation of the DCF is introduced, enabling the fusion of multiresolution features and sub-pixel accurate predictions. Finally, the problems of computational complexity and overfitting are addressed by investigating dimensionality reduction techniques.
As a second contribution, different feature representations for tracking are investigated. A particular focus is put on the analysis of color features, which had been largely overlooked in prior tracking research. This thesis also studies the use of deep features in DCF-based tracking. While many vision problems have greatly benefited from the advent of deep learning, it has proven difficult to harvest the power of such representations for tracking. In this thesis it is shown that both shallow and deep layers contribute positively. Furthermore, the problem of fusing their complementary properties is investigated.
The final major contribution of this thesis regards the prediction of the target scale. In many applications, it is essential to track the scale, or size, of the target since it is strongly related to the relative distance. A thorough analysis of how to integrate scale estimation into the DCF framework is performed. A one-dimensional scale filter is proposed, enabling efficient and accurate scale estimation.
@phdthesis{diva2:1201230,
author = {Danelljan, Martin},
title = {{Learning Convolution Operators for Visual Tracking}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1926}},
year = {2018},
address = {Sweden},
}
Object recognition is a skill we as humans often take for granted. Due to our formidable object learning, recognition and generalisation skills, it is sometimes hard to see the multitude of obstacles that need to be overcome in order to replicate this skill in an artificial system. Object recognition is also one of the classical areas of computer vision, and many ways of approaching the problem have been proposed. Recently, visually capable robots and autonomous vehicles have increased the focus on embodied recognition systems and active visual search. These applications demand that systems can learn and adapt to their surroundings, and arrive at decisions in a reasonable amount of time, while maintaining high object recognition performance. This is especially challenging due to the high dimensionality of image data. In cases where end-to-end learning from pixels to output is needed, mechanisms designed to make inputs tractable are often necessary for less computationally capable embodied systems.Active visual search also means that mechanisms for attention and gaze control are integral to the object recognition procedure. Therefore, the way in which attention mechanisms should be introduced into feature extraction and estimation algorithms must be carefully considered when constructing a recognition system.This thesis describes work done on the components necessary for creating an embodied recognition system, specifically in the areas of decision uncertainty estimation, object segmentation from multiple cues, adaptation of stereo vision to a specific platform and setting, problem-specific feature selection, efficient estimator training and attentional modulation in convolutional neural networks. Contributions include the evaluation of methods and measures for predicting the potential uncertainty reduction that can be obtained from additional views of an object, allowing for adaptive target observations. Also, in order to separate a specific object from other parts of a scene, it is often necessary to combine multiple cues such as colour and depth in order to obtain satisfactory results. Therefore, a method for combining these using channel coding has been evaluated. In order to make use of three-dimensional spatial structure in recognition, a novel stereo vision algorithm extension along with a framework for automatic stereo tuning have also been investigated. Feature selection and efficient discriminant sampling for decision tree-based estimators have also been implemented. Finally, attentional multi-layer modulation of convolutional neural networks for recognition in cluttered scenes has been evaluated. Several of these components have been tested and evaluated on a purpose-built embodied recognition platform known as Eddie the Embodied.
@phdthesis{diva2:1049161,
author = {Wallenberg, Marcus},
title = {{Embodied Visual Object Recognition}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1811}},
year = {2017},
address = {Sweden},
}
Driver assistance systems in modern cars now show clear steps towards autonomous driving and improvements are presented in a steady pace. The total number of sensors has also decreased from the vehicles of the initial DARPA challenge, more resembling a pile of sensors with a car underneath. Still, anyone driving a tele-operated toy using a video link is a demonstration that a single camera provides enough information about the surronding world.
Most lane assist systems are developed for highway use and depend on visible lane markers. However, lane markers may not be visible due to snow or wear, and there are roads without lane markers. With a slightly different approach, autonomous road following can be obtained on almost any kind of road. Using realtime online machine learning, a human driver can demonstrate driving on a road type unknown to the system and after some training, the system can seamlessly take over. The demonstrator system presented in this work has shown capability of learning to follow different types of roads as well as learning to follow a person. The system is based solely on vision, mapping camera images directly to control signals.
Such systems need the ability to handle multiple-hypothesis outputs as there may be several plausible options in similar situations. If there is an obstacle in the middle of the road, the obstacle can be avoided by going on either side. However the average action, going straight ahead, is not a viable option. Similarly, at an intersection, the system should follow one road, not the average of all roads.
To this end, an online machine learning framework is presented where inputs and outputs are represented using the channel representation. The learning system is structurally simple and computationally light, based on neuropsychological ideas presented by Donald Hebb over 60 years ago. Nonetheless the system has shown a cabability to learn advanced tasks. Furthermore, the structure of the system permits a statistical interpretation where a non-parametric representation of the joint distribution of input and output is generated. Prediction generates the conditional distribution of the output, given the input.
The statistical interpretation motivates the introduction of priors. In cases with multiple options, such as at intersections, a prior can select one mode in the multimodal distribution of possible actions. In addition to the ability to learn from demonstration, a possibility for immediate reinforcement feedback is presented. This allows for a system where the teacher can choose the most appropriate way of training the system, at any time and at her own discretion.
The theoretical contributions include a deeper analysis of the channel representation. A geometrical analysis illustrates the cause of decoding bias commonly present in neurologically inspired representations, and measures to counteract it. Confidence values are analyzed and interpreted as evidence and coherence. Further, the use of the truncated cosine basis function is motivated.
Finally, a selection of applications is presented, such as autonomous road following by online learning and head pose estimation. A method founded on the same basic principles is used for visual tracking, where the probabilistic representation of target pixel values allows for changes in target appearance.
@phdthesis{diva2:916645,
author = {Öfjäll, Kristoffer},
title = {{Adaptive Supervision Online Learning for Vision Based Autonomous Systems}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1749}},
year = {2016},
address = {Sweden},
}
Thermal cameras have historically been of interest mainly for military applications. Increasing image quality and resolution combined with decreasing price and size during recent years have, however, opened up new application areas. They are now widely used for civilian applications, e.g., within industry, to search for missing persons, in automotive safety, as well as for medical applications. Thermal cameras are useful as soon as it is possible to measure a temperature difference. Compared to cameras operating in the visual spectrum, they are advantageous due to their ability to see in total darkness, robustness to illumination variations, and less intrusion on privacy.
This thesis addresses the problem of detection and tracking in thermal infrared imagery. Visual detection and tracking of objects in video are research areas that have been and currently are subject to extensive research. Indications oftheir popularity are recent benchmarks such as the annual Visual Object Tracking (VOT) challenges, the Object Tracking Benchmarks, the series of workshops on Performance Evaluation of Tracking and Surveillance (PETS), and the workshops on Change Detection. Benchmark results indicate that detection and tracking are still challenging problems.
A common belief is that detection and tracking in thermal infrared imagery is identical to detection and tracking in grayscale visual imagery. This thesis argues that the preceding allegation is not true. The characteristics of thermal infrared radiation and imagery pose certain challenges to image analysis algorithms. The thesis describes these characteristics and challenges as well as presents evaluation results confirming the hypothesis.
Detection and tracking are often treated as two separate problems. However, some tracking methods, e.g. template-based tracking methods, base their tracking on repeated specific detections. They learn a model of the object that is adaptively updated. That is, detection and tracking are performed jointly. The thesis includes a template-based tracking method designed specifically for thermal infrared imagery, describes a thermal infrared dataset for evaluation of template-based tracking methods, and provides an overview of the first challenge on short-term,single-object tracking in thermal infrared video. Finally, two applications employing detection and tracking methods are presented.
@phdthesis{diva2:918038,
author = {Berg, Amanda},
title = {{Detection and Tracking in Thermal Infrared Imagery}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1744}},
year = {2016},
address = {Sweden},
}
This dissertation addresses the problem of adaptive image filtering.
Although the topic has a long history in the image processing community, researchers continuously present novel methods to obtain ever better image restoration results.
With an expanding market for individuals who wish to share their everyday life on social media, imaging techniques such as compact cameras and smart phones are important factors. Naturally, every producer of imaging equipment desires to exploit cheap camera components while supplying high quality images. One step in this pipeline is to use sophisticated imaging software including, e.g., noise reduction to reduce manufacturing costs, while maintaining image quality.
This thesis is based on traditional formulations such as isotropic and tensor-based anisotropic diffusion for image denoising. The difference from main-stream denoising methods is that this thesis explores the effects of introducing contextual information as prior knowledge for image denoising into the filtering schemes. To achieve this, the adaptive filtering theory is formulated from an energy minimization standpoint. The core contributions of this work is the introduction of a novel tensor-based functional which unifies and generalises standard diffusion methods. Additionally, the explicit Euler-Lagrange equation is derived which, if solved, yield the stationary point for the minimization problem. Several aspects of the functional are presented in detail which include, but are not limited to, tensor symmetry constraints and convexity. Also, the classical problem of finding a variational formulation to a given tensor-based partial differential equation is studied.
The presented framework is applied in problem formulation that includes non-linear domain transformation, e.g., visualization of medical images.
Additionally, the framework is also used to exploit locally estimated probability density functions or the channel representation to drive the filtering process.
Furthermore, one of the first truly tensor-based formulations of total variation is presented. The key to the formulation is the gradient energy tensor, which does not require spatial regularization of its tensor components. It is shown empirically in several computer vision applications, such as corner detection and optical flow, that the gradient energy tensor is a viable replacement for the commonly used structure tensor. Moreover, the gradient energy tensor is used in the traditional tensor-based anisotropic diffusion scheme. This approach results in significant improvements in computational speed when the scheme is implemented on a graphical processing unit compared to using the commonly used structure tensor.
@phdthesis{diva2:789680,
author = {Åström, Freddie},
title = {{Variational Tensor-Based Models for Image Diffusion in Non-Linear Domains}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1646}},
year = {2015},
address = {Sweden},
}
In tele-operated robotics applications, the primary information channel from the robot to its human operator is a video stream. For autonomous robotic systems however, a much larger selection of sensors is employed, although the most relevant information for the operation of the robot is still available in a single video stream. The issue lies in autonomously interpreting the visual data and extracting the relevant information, something humans and animals perform strikingly well. On the other hand, humans have great diculty expressing what they are actually looking for on a low level, suitable for direct implementation on a machine. For instance objects tend to be already detected when the visual information reaches the conscious mind, with almost no clues remaining regarding how the object was identied in the rst place. This became apparent already when Seymour Papert gathered a group of summer workers to solve the computer vision problem 48 years ago [35].
Articial learning systems can overcome this gap between the level of human visual reasoning and low-level machine vision processing. If a human teacher can provide examples of what to be extracted and if the learning system is able to extract the gist of these examples, the gap is bridged. There are however some special demands on a learning system for it to perform successfully in a visual context. First, low level visual input is often of high dimensionality such that the learning system needs to handle large inputs. Second, visual information is often ambiguous such that the learning system needs to be able to handle multi modal outputs, i.e. multiple hypotheses. Typically, the relations to be learned are non-linear and there is an advantage if data can be processed at video rate, even after presenting many examples to the learning system. In general, there seems to be a lack of such methods.
This thesis presents systems for learning perception-action mappings for robotic systems with visual input. A range of problems are discussed, such as vision based autonomous driving, inverse kinematics of a robotic manipulator and controlling a dynamical system. Operational systems demonstrating solutions to these problems are presented. Two dierent approaches for providing training data are explored, learning from demonstration (supervised learning) and explorative learning (self-supervised learning). A novel learning method fullling the stated demands is presented. The method, qHebb, is based on associative Hebbian learning on data in channel representation. Properties of the method are demonstrated on a vision-based autonomously driving vehicle, where the system learns to directly map low-level image features to control signals. After an initial training period, the system seamlessly continues autonomously. In a quantitative evaluation, the proposed online learning method performed comparably with state of the art batch learning methods.
@phdthesis{diva2:750053,
author = {Öfjäll, Kristoffer},
title = {{Online Learning for Robot Vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1678}},
year = {2014},
address = {Sweden},
}
Almost all cell-phones and camcorders sold today are equipped with a CMOS (Complementary Metal Oxide Semiconductor) image sensor and there is also a general trend to incorporate CMOS sensors in other types of cameras. The CMOS sensor has many advantages over the more conventional CCD (Charge-Coupled Device) sensor such as lower power consumption, cheaper manufacturing and the potential for onchip processing. Nearly all CMOS sensors make use of what is called a rolling shutter readout. Unlike a global shutter readout, which images all the pixels at the same time, a rolling-shutter exposes the image row-by-row. If a mechanical shutter is not used this will lead to geometric distortions in the image when either the camera or the objects in the scene are moving. Smaller cameras, like those in cell-phones, do not have mechanical shutters and systems that do have them will not use them when recording video. The result will look wobbly (jello eect), skewed or otherwise strange and this is often not desirable. In addition, many computer vision algorithms assume that the camera used has a global shutter and will break down if the distortions are too severe.
In airborne remote sensing it is common to use push-broom sensors. These sensors exhibit a similar kind of distortion as that of a rolling-shutter camera, due to the motion of the aircraft. If the acquired images are to be registered to maps or other images, the distortions need to be suppressed.
The main contributions in this thesis are the development of the three-dimensional models for rolling-shutter distortion correction. Previous attempts modelled the distortions as taking place in the image plane, and we have shown that our techniques give better results for hand-held camera motions. The basic idea is to estimate the camera motion, not only between frames, but also the motion during frame capture. The motion is estimated using image correspondences and with these a non-linear optimisation problem is formulated and solved. All rows in the rollingshutter image are imaged at dierent times, and when the motion is known, each row can be transformed to its rectied position. The same is true when using depth sensors such as the Microsoft Kinect, and the thesis describes how to estimate its 3D motion and how to rectify 3D point clouds.
In the thesis it has also been explored how to use similar techniques as for the rolling-shutter case, to correct push-broom images. When a transformation has been found, the images need to be resampled to a regular grid in order to be visualised. This can be done in many ways and dierent methods have been tested and adapted to the push-broom setup.
In addition to rolling-shutter distortions, hand-held footage often has shaky camera motion. It is possible to do ecient video stabilisation in combination with the rectication using rotation smoothing. Apart from these distortions, motion blur is a big problem for hand-held photography. The images will be blurry due to the camera motion and also noisy if taken in low light conditions. One of the contributions in the thesis is a method which uses gyroscope measurements and feature tracking to combine several images, taken with a smartphone, into one resulting image with less blur and noise. This enables the user to take photos which would have otherwise required a tripod.
@phdthesis{diva2:742702,
author = {Ringaby, Erik},
title = {{Geometric Models for Rolling-shutter and Push-broom Sensors}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1615}},
year = {2014},
address = {Sweden},
}
Over the last decade, the use of unmanned aerial vehicles (UAVs) has increased drastically. Originally, the use of these aircraft was mainly military, but today many civil applications have emerged. UAVs are frequently the preferred choice for surveillance missions in disaster areas, after earthquakes or hurricanes, and in hazardous environments, e.g. for detection of nuclear radiation. The UAVs employed in these missions are often relatively small in size which implies payload restrictions.
For navigation of the UAVs, continuous global pose (position and attitude) estimation is mandatory. Cameras can be fabricated both small in size and light in weight. This makes vision-based methods well suited for pose estimation onboard these vehicles. It is obvious that no single method can be used for pose estimation in all dierent phases throughout a ight. The image content will be very dierent on the runway, during ascent, during ight at low or high altitude, above urban or rural areas, etc. In total, a multitude of pose estimation methods is required to handle all these situations. Over the years, a large number of vision-based pose estimation methods for aerial images have been developed. But there are still open research areas within this eld, e.g. the use of omnidirectional images for pose estimation is relatively unexplored.
The contributions of this thesis are three vision-based methods for global egopositioning and/or attitude estimation from aerial images. The rst method for full 6DoF (degrees of freedom) pose estimation is based on registration of local height information with a geo-referenced 3D model. A dense local height map is computed using motion stereo. A pose estimate from navigation sensors is used as an initialization. The global pose is inferred from the 3D similarity transform between the local height map and the 3D model. Aligning height information is assumed to be more robust to season variations than feature matching in a single-view based approach.
The second contribution is a method for attitude (pitch and roll angle) estimation via horizon detection. It is one of only a few methods in the literature that use an omnidirectional (sheye) camera for horizon detection in aerial images. The method is based on edge detection and a probabilistic Hough voting scheme. In a ight scenario, there is often some knowledge on the probability density for the altitude and the attitude angles. The proposed method allows this prior information to be used to make the attitude estimation more robust.
The third contribution is a further development of method two. It is the very rst method presented where the attitude estimates from the detected horizon in omnidirectional images is rened through registration with the geometrically expected horizon from a digital elevation model. It is one of few methods where the ray refraction in the atmosphere is taken into account, which contributes to the highly accurate pose estimates. The attitude errors obtained are about one order of magnitude smaller than for any previous vision-based method for attitude estimation from horizon detection in aerial images.
@phdthesis{diva2:729563,
author = {Grelsson, Bertil},
title = {{Global Pose Estimation from Aerial Images:
Registration with Elevation Models}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1672}},
year = {2014},
address = {Sweden},
}
Object recognition is a skill we as humans often take for granted. Due to our formidable object learning, recognition and generalisation skills, it is sometimes hard to see the multitude of obstacles that need to be overcome in order to replicate this skill in an artificial system. Object recognition is also one of the classical areas of computer vision, and many ways of approaching the problem have been proposed. Recently, visually capable robots and autonomous vehicles have increased the focus on embodied recognition systems and active visual search. These applications demand that systems can learn and adapt to their surroundings, and arrive at decisions in a reasonable amount of time, while maintaining high object recognition performance. Active visual search also means that mechanisms for attention and gaze control are integral to the object recognition procedure. This thesis describes work done on the components necessary for creating an embodied recognition system, specifically in the areas of decision uncertainty estimation, object segmentation from multiple cues, adaptation of stereo vision to a specific platform and setting, and the implementation of the system itself. Contributions include the evaluation of methods and measures for predicting the potential uncertainty reduction that can be obtained from additional views of an object, allowing for adaptive target observations. Also, in order to separate a specific object from other parts of a scene, it is often necessary to combine multiple cues such as colour and depth in order to obtain satisfactory results. Therefore, a method for combining these using channel coding has been evaluated. Finally, in order to make use of three-dimensional spatial structure in recognition, a novel stereo vision algorithm extension along with a framework for automatic stereo tuning have also been investigated. All of these components have been tested and evaluated on a purpose-built embodied recognition platform known as Eddie the Embodied.
@phdthesis{diva2:634701,
author = {Wallenberg, Marcus},
title = {{Components of Embodied Visual Object Recognition:
Object Perception and Learning on a Robotic Platform}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1607}},
year = {2013},
address = {Sweden},
}
Image filtering methods are designed to enhance noisy images captured in situations that are problematic for the camera sensor. Such noisy images originate from unfavourable illumination conditions, camera motion, or the desire to use only a low dose of ionising radiation in medical imaging. Therefore, in this thesis work I have investigated the theory of partial differential equations (PDE) to design filtering methods that attempt to remove noise from images. This is achieved by modeling and deriving energy functionals which in turn are minimized to attain a state of minimum energy. This state is obtained by solving the so called Euler-Lagrange equation. An important theoretical contribution of this work is that conditions are put forward determining when a PDE has a corresponding energy functional. This is in particular described in the case of the structure tensor, a commonly used tensor in computer vision.A primary component of this thesis work is to model adaptive image filtering such that any modification of the image is structure preserving, but yet is noise suppressing. In color image filtering this is a particular challenge since artifacts may be introduced at color discontinuities. For this purpose a non-Euclidian color opponent transformation has been analysed and used to separate the standard RGB color space into uncorrelated components.A common approach to achieve adaptive image filtering is to select an edge stopping function from a set of functions that have proven to work well in the past. The purpose of the edge stopping function is to inhibit smoothing of image features that are desired to be retained, such as lines, edges or other application dependent characteristics. Thus, a step from ad-hoc filtering based on experience towards an application-driven filtering is taken, such that only desired image features are processed. This improves what is characterised as visually relevant features, a topic which this thesis covers, in particular for medical imaging.The notion of what are relevant features is a subjective measure may be different from a layman's opinion compared to a professional's. Therefore, we advocate that any image filtering method should yield an improvement not only in numerical measures but also a visual improvement should be experienced by the respective end-user
@phdthesis{diva2:622727,
author = {Åström, Freddie},
title = {{A Variational Approach to Image Diffusion in Non-Linear Domains}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1594}},
year = {2013},
address = {Sweden},
}
Almost all cell-phones and camcorders sold today are equipped with a CMOS (Complementary Metal Oxide Semiconductor) image sensor and there is also a general trend to incorporate CMOS sensors in other types of cameras. The sensor has many advantages over the more conventional CCD (Charge-Coupled Device) sensor such as lower power consumption, cheaper manufacturing and the potential for on-chip processing. Almost all CMOS sensors make use of what is called a rolling shutter. Compared to a global shutter, which images all the pixels at the same time, a rolling-shutter camera exposes the image row-by-row. This leads to geometric distortions in the image when either the camera or the objects in the scene are moving. The recorded videos and images will look wobbly (jello effect), skewed or otherwise strange and this is often not desirable. In addition, many computer vision algorithms assume that the camera used has a global shutter, and will break down if the distortions are too severe.
In airborne remote sensing it is common to use push-broom sensors. These sensors exhibit a similar kind of distortion as a rolling-shutter camera, due to the motion of the aircraft. If the acquired images are to be matched with maps or other images, then the distortions need to be suppressed.
The main contributions in this thesis are the development of the three dimensional models for rolling-shutter distortion correction. Previous attempts modelled the distortions as taking place in the image plane, and we have shown that our techniques give better results for hand-held camera motions.
The basic idea is to estimate the camera motion, not only between frames, but also the motion during frame capture. The motion can be estimated using inter-frame image correspondences and with these a non-linear optimisation problem can be formulated and solved. All rows in the rolling-shutter image are imaged at different times, and when the motion is known, each row can be transformed to the rectified position.
In addition to rolling-shutter distortions, hand-held footage often has shaky camera motion. It has been shown how to do efficient video stabilisation, in combination with the rectification, using rotation smoothing.
In the thesis it has been explored how to use similar techniques as for the rolling-shutter case in order to correct push-broom images, and also how to rectify 3D point clouds from e.g. the Kinect depth sensor.
@phdthesis{diva2:526675,
author = {Ringaby, Erik},
title = {{Geometric Computer Vision for Rolling-shutter and Push-broom Sensors}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1535}},
year = {2012},
address = {Sweden},
}
Digital camera equipped cell phones were introduced in Japan in 2001, they quickly became popular and by 2003 outsold the entire stand-alone digital camera market. In 2010 sales passed one billion units and the market is still growing. Another trend is the rising popularity of smartphones which has led to a rapid development of the processing power on a phone, and many units sold today bear close resemblance to a personal computer. The combination of a powerful processor and a camera which is easily carried in your pocket, opens up a large eld of interesting computer vision applications.
The core contribution of this thesis is the development of methods that allow an imaging device such as the cell phone camera to estimates its own motion and to capture the observed scene structure. One of the main focuses of this thesis is real-time performance, where a real-time constraint does not only result in shorter processing times, but also allows for user interaction.
In computer vision, structure from motion refers to the process of estimating camera motion and 3D structure by exploring the motion in the image plane caused by the moving camera. This thesis presents several methods for estimating camera motion. Given the assumption that a set of images has known camera poses associated to them, we train a system to solve the camera pose very fast for a new image. For the cases where no a priory information is available a fast minimal case solver is developed. The solver uses ve points in two camera views to estimate the cameras relative position and orientation. This type of minimal case solver is usually used within a RANSAC framework. In order to increase accuracy and performance a renement to the random sampling strategy of RANSAC is proposed. It is shown that the new scheme doubles the performance for the ve point solver used on video data. For larger systems of cameras a new Bundle Adjustment method is developed which are able to handle video from cell phones.
Demands for reduction in size, power consumption and price has led to a redesign of the image sensor. As a consequence the sensors have changed from a global shutter to a rolling shutter, where a rolling shutter image is acquired row by row. Classical structure from motion methods are modeled on the assumption of a global shutter and a rolling shutter can severely degrade their performance. One of the main contributions of this thesis is a new Bundle Adjustment method for cameras with a rolling shutter. The method accurately models the camera motion during image exposure with an interpolation scheme for both position and orientation.
The developed methods are not restricted to cellphones only, but is rather applicable to any type of mobile platform that is equipped with cameras, such as a autonomous car or a robot. The domestic robot comes in many avors, everything from vacuum cleaners to service and pet robots. A robot equipped with a camera that is capable of estimating its own motion while sensing its environment, like the human eye, can provide an eective means of navigation for the robot. Many of the presented methods are well suited of robots, where low latency and real-time constraints are crucial in order to allow them to interact with their environment.
@phdthesis{diva2:517601,
author = {Hedborg, Johan},
title = {{Motion and Structure Estimation From Video}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1449}},
year = {2012},
address = {Sweden},
}
Traffic accidents are globally the number one cause of death for people 15-29 years old and is among the top three causes for all age groups 5-44 years. Much of the work within this thesis has been carried out in projects aiming for (cognitive) driver assistance systems and hopefully represents a step towards improving traffic safety.
The main contributions are within the area of Computer Vision, and more specifically, within the areas of shape matching, Bayesian tracking, and visual servoing with the main focus being on shape matching and applications thereof. The different methods have been demonstrated in traffic safety applications, such as bicycle tracking, car tracking, and traffic sign recognition, as well as for pose estimation and robot control.
One of the core contributions is a new method for recognizing closed contours, based on complex correlation of Fourier descriptors. It is shown that keeping the phase of Fourier descriptors is important. Neglecting the phase can result in perfect matches between intrinsically different shapes. Another benefit of keeping the phase is that rotation covariant or invariant matching is achieved in the same way. The only difference is to either consider the magnitude, for rotation invariant matching, or just the real value, for rotation covariant matching, of the complex valued correlation.
The shape matching method has further been used in combination with an implicit star-shaped object model for traffic sign recognition. The presented method works fully automatically on query images with no need for regions-of-interests. It is shown that the presented method performs well for traffic signs that contain multiple distinct contours, while some improvement still is needed for signs defined by a single contour. The presented methodology is general enough to be used for arbitrary objects, as long as they can be defined by a number of regions.
Another contribution has been the extension of a framework for learning based Bayesian tracking called channel based tracking. Compared to earlier work, the multi-dimensional case has been reformulated in a sound probabilistic way and the learning algorithm itself has been extended. The framework is evaluated in car tracking scenarios and is shown to give competitive tracking performance, compared to standard approaches, but with the advantage of being fully learnable.
The last contribution has been in the field of (cognitive) robot control. The presented method achieves sufficient accuracy for simple assembly tasks by combining autonomous recognition with visual servoing, based on a learned mapping between percepts and actions. The method demonstrates that limitations of inexpensive hardware, such as web cameras and low-cost robotic arms, can be overcome using powerful algorithms.
All in all, the methods developed and presented in this thesis can all be used for different components in a system guided by visual information, and hopefully represents a step towards improving traffic safety.
@phdthesis{diva2:452207,
author = {Larsson, Fredrik},
title = {{Shape Based Recognition -- Cognitive Vision Systems in Traffic Safety Applications}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1395}},
year = {2011},
address = {Sweden},
}
Autonomous navigation for ground vehicles has many challenges. Autonomous systems must be able to self-localise, avoid obstacles and determine navigable surfaces. This thesis studies several aspects of autonomous navigation with a particular emphasis on vision, motivated by it being a primary component for navigation in many high-level biological organisms. The key problem of self-localisation or pose estimation can be solved through analysis of the changes in appearance of rigid objects observed from different view points. We therefore describe a system for structure and motion estimation for real-time navigation and obstacle avoidance. With the explicit assumption of a calibrated camera, we have studied several schemes for increasing accuracy and speed of the estimation.The basis of most structure and motion pose estimation algorithms is a good point tracker. However point tracking is computationally expensive and can occupy a large portion of the CPU resources. In thisthesis we show how a point tracker can be implemented efficiently on the graphics processor, which results in faster tracking of points and the CPU being available to carry out additional processing tasks.In addition we propose a novel view interpolation approach, that can be used effectively for pose estimation given previously seen views. In this way, a vehicle will be able to estimate its location by interpolating previously seen data.Navigation and obstacle avoidance may be carried out efficiently using structure and motion, but only whitin a limited range from the camera. In order to increase this effective range, additional information needs to be incorporated, more specifically the location of objects in the image. For this, we propose a real-time object recognition method, which uses P-channel matching, which may be used for improving navigation accuracy at distances where structure estimation is unreliable.
@phdthesis{diva2:345040,
author = {Hedborg, Johan},
title = {{Pose Estimation and Structure Analysis of Image Sequences}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1418}},
year = {2009},
address = {Sweden},
}
This thesis deals with three topics; Bayesian tracking, shape matching and visual servoing. These topics are bound together by the goal of visual control of robotic systems. The work leading to this thesis was conducted within two European projects, COSPAL and DIPLECS, both with the stated goal of developing artificial cognitive systems. Thus, the ultimate goal of my research is to contribute to the development of artificial cognitive systems.
The contribution to the field of Bayesian tracking is in the form of a framework called Channel Based Tracking (CBT). CBT has been proven to perform competitively with particle filter based approaches but with the added advantage of not having to specify the observation or system models. CBT uses channel representation and correspondence free learning in order to acquire the observation and system models from unordered sets of observations and states. We demonstrate how this has been used for tracking cars in the presence of clutter and noise.
The shape matching part of this thesis presents a new way to match Fourier Descriptors (FDs). We show that it is possible to take rotation and index shift into account while matching FDs without explicitly de-rotate the contours or neglecting the phase. We also propose to use FDs for matching locally extracted shapes in contrast to the traditional way of using FDs to match the global outline of an object. We have in this context evaluated our matching scheme against the popular Affine Invariant FDs and shown that our method is clearly superior.
In the visual servoing part we present a visual servoing method that is based on an action precedes perception approach. By applying random action with a system, e.g. a robotic arm, it is possible to learn a mapping between action space and percept space. In experiments we show that it is possible to achieve high precision positioning of a robotic arm without knowing beforehand how the robotic arm looks like or how it is controlled.
@phdthesis{diva2:278320,
author = {Larsson, Fredrik},
title = {{Methods for Visually Guided Robotic Systems:
Matching, Tracking and Servoing}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1416}},
year = {2009},
address = {Sweden},
}
State-of-the-art reconstruction algorithms for medical helical cone-beam Computed Tomography (CT) are of type non-exact Filtered Backprojection (FBP). They are attractive because of their simplicity and low computational cost, but they produce sub-optimal images with respect to artifacts, resolution, and noise. This thesis deals with possibilities to improve the image quality by means of iterative techniques.
The first algorithm, Regularized Iterative Weighted Filtered Backprojection (RIWFBP), is an iterative algorithm employing the non-exact Weighted FilteredBackprojection (WFBP) algorithm [Stierstorfer et al., Phys. Med. Biol. 49, 2209-2218, 2004] in the update step. We have measured and compared artifact reduction as well as resolution and noise properties for RIWFBP and WFBP. The results show that artifacts originating in the non-exactness of the WFBP algorithm are suppressed within five iterations without notable degradation in terms of resolution versus noise. Our experiments also indicate that the number of required iterations can be reduced by employing a technique known as ordered subsets.
A small modification of RIWFBP leads to a new algorithm, the Weighted Least Squares Iterative Filtered Backprojection (WLS-IFBP). This algorithm has a slightly lower rate of convergence than RIWFBP, but in return it has the attractive property of converging to a solution of a certain least squares minimization problem. Hereby, theory and algorithms from optimization theory become applicable.
Besides linear regularization, we have examined edge-preserving non-linear regularization.In this case, resolution becomes contrast dependent, a fact that can be utilized for improving high contrast resolution without degrading the signal-to-noise ratio in low contrast regions. Resolution measurements at different contrast levels and anthropomorphic phantom studies confirm this property. Furthermore, an even morepronounced suppression of artifacts is observed.
Iterative reconstruction opens for more realistic modeling of the input data acquisition process than what is possible with FBP. We have examined the possibility to improve the forward projection model by (i) multiple ray models, and (ii) calculating strip integrals instead of line integrals. In both cases, for linearregularization, the experiments indicate a trade off: the resolution is improved atthe price of increased noise levels. With non-linear regularization on the other hand, the degraded signal-to-noise ratio in low contrast regions can be avoided.
Huge input data sizes make experiments on real medical CT data very demanding. To alleviate this problem, we have implemented the most time consuming parts of the algorithms on a Graphics Processing Unit (GPU). These implementations are described in some detail, and some specific problems regarding parallelism and memory access are discussed.
@phdthesis{diva2:232734,
author = {Sunnegårdh, Johan},
title = {{Iterative Filtered Backprojection Methods for Helical Cone-Beam CT}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1264}},
year = {2009},
address = {Sweden},
}
This thesis is about channel-coded feature maps applied in view-based object recognition, tracking, and machine learning. A channel-coded feature map is a soft histogram of joint spatial pixel positions and image feature values. Typical useful features include local orientation and color. Using these features, each channel measures the co-occurrence of a certain orientation and color at a certain position in an image or image patch. Channel-coded feature maps can be seen as a generalization of the SIFT descriptor with the options of including more features and replacing the linear interpolation between bins by a more general basis function.
The general idea of channel coding originates from a model of how information might be represented in the human brain. For example, different neurons tend to be sensitive to different orientations of local structures in the visual input. The sensitivity profiles tend to be smooth such that one neuron is maximally activated by a certain orientation, with a gradually decaying activity as the input is rotated.
This thesis extends previous work on using channel-coding ideas within computer vision and machine learning. By differentiating the channel-coded feature maps with respect to transformations of the underlying image, a method for image registration and tracking is constructed. By using piecewise polynomial basis functions, the channel coding can be computed more efficiently, and a general encoding method for N-dimensional feature spaces is presented.
Furthermore, I argue for using channel-coded feature maps in view-based pose estimation, where a continuous pose parameter is estimated from a query image given a number of training views with known pose. The optimization of position, rotation and scale of the object in the image plane is then included in the optimization problem, leading to a simultaneous tracking and pose estimation algorithm. Apart from objects and poses, the thesis examines the use of channel coding in connection with Bayesian networks. The goal here is to avoid the hard discretizations usually required when Markov random fields are used on intrinsically continuous signals like depth for stereo vision or color values in image restoration.
Channel coding has previously been used to design machine learning algorithms that are robust to outliers, ambiguities, and discontinuities in the training data. This is obtained by finding a linear mapping between channel-coded input and output values. This thesis extends this method with an incremental version and identifies and analyzes a key feature of the method -- that it is able to handle a learning situation where the correspondence structure between the input and output space is not completely known. In contrast to a traditional supervised learning setting, the training examples are groups of unordered input-output points, where the correspondence structure within each group is unknown. This behavior is studied theoretically and the effect of outliers and convergence properties are analyzed.
All presented methods have been evaluated experimentally. The work has been conducted within the cognitive systems research project COSPAL funded by EC FP6, and much of the contents has been put to use in the final COSPAL demonstrator system.
@phdthesis{diva2:17496,
author = {Jonsson, Erik},
title = {{Channel-Coded Feature Maps for Computer Vision and Machine Learning}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 1160}},
year = {2008},
address = {Sweden},
}
Contemporary algorithms employed for reconstruction of 3D volumes from helical cone beam projections are so called non-exact algorithms. This means that the reconstructed volumes contain artifacts irrespective of the detector resolution and number of projection angles employed in the process. In this thesis, three iterative schemes for suppression of these so called cone artifacts are investigated.
The first scheme, iterative weighted filtered backprojection (IWFBP), is based on iterative application of a non-exact algorithm. For this method, artifact reduction, as well as spatial resolution and noise properties are measured. During the first five iterations, cone artifacts are clearly reduced. As a side effect, spatial resolution and noise are increased. To avoid this side effect and improve the convergence properties, a regularization procedure is proposed and evaluated.
In order to reduce the cost of the IWBP scheme, a second scheme is created by combining IWFBP with the so called ordered subsets technique, which we call OSIWFBP. This method divides the projection data set into subsets, and operates sequentially on each of these in a certain order, hence the name “ordered subsets”. We investigate two different ordering schemes and number of subsets, as well as the possibility to accelerate cone artifact suppression. The main conclusion is that the ordered subsets technique indeed reduces the number of iterations needed, but that it suffers from the drawback of noise amplification.
The third scheme starts by dividing input data into high- and low-frequency data, followed by non-iterative reconstruction of the high-frequency part and IWFBP reconstruction of the low-frequency part. This could open for acceleration by reduction of data in the iterative part. The results show that a suppression of artifacts similar to that of the IWFBP method can be obtained, even if a significant part of high-frequency data is non-iteratively reconstructed.
@phdthesis{diva2:23125,
author = {Sunnegårdh, Johan},
title = {{Combining analytical and iterative reconstruction in helical cone-beam CT}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1301}},
year = {2007},
address = {Sweden},
}
The possibility to use real-time computer vision in video sequences gives many opportunities for a system to interact with the environment. Possible ways for interaction are e.g. augmented reality like in the MATRIS project where the purpose is to add new objects into the video sequence, or surveillance where the purpose is to find abnormal events.
The increase of the speed of computers the last years has simplified this process and it is now possible to use at least some of the more advanced computer vision algorithms that are available. The computational speed of computers is however still a problem, for an efficient real-time system efficient code and methods are necessary. This thesis deals with both problems, one part is about efficient implementations using single instruction multiple data (SIMD) instructions and one part is about robust tracking.
An efficient real-time system requires efficient implementations of the used computer vision methods. Efficient implementations requires knowledge about the CPU and the possibilities given. In this thesis, one method called SIMD is explained. SIMD is useful when the same operation is applied to multiple data which usually is the case in computer vision, the same operation is executed on each pixel.
Following the position of a feature or object in a video sequence is called tracking. Tracking can be used for a number of applications. The application in this thesis is to use tracking for pose estimation. One way to do tracking is to cut out a small region around the feature, creating a patch and find the position on this patch in the other frames. To find the position, a measure of the difference between the patch and the image in a given position is used. This thesis thoroughly investigates the sum of absolute difference (SAD) error measure. The investigation involves different ways to improve the robustness and to decrease the average error. One method to estimate the average error, the covariance of the position error is proposed. An estimate of the average error is needed when different measurements are combined.
Finally, a system for camera pose estimation is presented. The computer vision part of this system is based on the result in this thesis. This presentation contains also a discussion about the result of this system.
@phdthesis{diva2:22906,
author = {Skoglund, Johan},
title = {{Robust Real-Time Estimation of Region Displacements in Video Sequences}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1296}},
year = {2007},
address = {Sweden},
}
This thesis presents work done within the EC-founded project VISATEC. Due to the different directions of the VISATEC project this thesis has a few different threads.
A novel presentation scheme for medium level vision features applied to range sensor data and to image sequences. Some estimation procedures for this representation have been implemented and tested. The representation is tensor based and uses higher order tensors in a projective space. The tensor can hold information on several local structures including their relative position and orientation. This information can also be extracted from the tensor.
A number of well-known techniques are combined in a novel way to be able to perform object pose estimation under changes of the object in position, scale and rotation from a single 2D image. The local feature used is a patch which is resampled in a log-polar pattern. A number of local features are matched to a database and the k nearest neighbors vote an object state parameters. This most probable object states are found through mean-shift clustering.
A system using multi-cue integration as a means of reaching a higher level of system-level robustness and a higher lever of accuracy is developed and evaluated in an industrial-like-setting. The system is based around a robotic manipulator arm with an attached camera. The system is designed to solve parts of the bin-picking problem. The above mentioned 2D technique for object pose estimation is also evaluated within this system.
@phdthesis{diva2:21426,
author = {Viksten, Fredrik},
title = {{Methods for vision-based robotic automation}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1161}},
year = {2005},
address = {Sweden},
}
This thesis presents methods useful in a bin picking application, such as detection and representation of local features, pose estimation and multi-cue integration.
The scene tensor is a representation of multiple line or edge segments and was first introduced by Nordberg in [30]. A method for estimating scene tensors from gray-scale images is presented. The method is based on orientation tensors, where the scene tensor can be estimated by correlations of the elements in the orientation tensor with a number of 1D filters. Mechanisms for analyzing the scene tensor are described and an algorithm for detecting interest points and estimating feature parameters is presented. It is shown that the algorithm works on a wide spectrum of images with good result.
Representations that are invariant with respect to a set of transformations are useful in many applications, such as pose estimation, tracking and wide baseline stereo. The scene tensor itself is not invariant and three different methods for implementing an invariant representation based on the scene tensor is presented. One is based on a non-linear transformation of the scene tensor and is invariant to perspective transformations. Two versions of a tensor doublet is presented, which is based on a geometry of two interest points and is invariant to translation, rotation and scaling. The tensor doublet is used in a framework for view centered pose estimation of 3D objects. It is shown that the pose estimation algorithm has good performance even though the object is occluded and has a different scale compared to the training situation.
An industrial implementation of a bin picking application have to cope with several different types of objects. All pose estimation algorithms use some kind of model and there is yet no model that can cope with all kinds of situations and objects. This thesis presents a method for integrating cues from several pose estimation algorithms for increasing the system stability. It is also shown that the same framework can also be used for increasing the accuracy of the system by using cues from several views of the object. An extensive test with several different objects, lighting conditions and backgrounds shows that multi-cue integration makes the system more robust and increases the accuracy.
Finally, a system for bin picking is presented, built from the previous parts of this thesis. An eye in hand setup is used with a standard industrial robot arm. It is shown that the system works for real bin-picking situations with a positioning error below 1 mm and an orientation error below 1o degree for most of the different situations.
@phdthesis{diva2:21387,
author = {Söderberg, Robert},
title = {{Compact Representations and Multi-cue Integration for Robotics}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 1160}},
year = {2005},
address = {Sweden},
}
This thesis introduces and explores a new type of representation for low and medium level vision operations called channel representation. The channel representation is a more general way to represent information than e.g. as numerical values, since it allows incorporation of uncertainty, and simultaneous representation of several hypotheses. More importantly it also allows the representation of “no information” when no statement can be given. A channel representation of a scalar value is a vector of channel values, which are generated by passing the original scalar value through a set of kernel functions. The resultant representation is sparse and monopolar. The word sparse signifies that information is not necessarily present in all channels. On the contrary, most channel values will be zero. The word monopolar signifies that all channel values have the same sign, e.g. they are either positive or zero. A zero channel value denotes “no information”, and for non-zero values, the magnitude signifies the relevance.
In the thesis, a framework for channel encoding and local decoding of scalar values is presented. Averaging in the channel representation is identified as a regularised sampling of a probability density function. A subsequent decoding is thus a mode estimation technique.'
The mode estimation property of channel averaging is exploited in the channel smoothing technique for image noise removal. We introduce an improvement to channel smoothing, called alpha synthesis, which deals with the problem of jagged edges present in the original method. Channel smoothing with alpha synthesis is compared to mean-shift filtering, bilateral filtering, median filtering, and normalized averaging with favourable results.
A fast and robust blob-feature extraction method for vector fields is developed. The method is also extended to cluster constant slopes instead of constant regions. The method is intended for view-based object recognition and wide baseline matching. It is demonstrated on a wide baseline matching problem.
A sparse scale-space representation of lines and edges is implemented and described. The representation keeps line and edge statements separate, and ensures that they are localised by inhibition from coarser scales. The result is however still locally continuous, in contrast to non-max-suppression approaches, which introduce a binary threshold.
The channel representation is well suited to learning, which is demonstrated by applying it in an associative network. An analysis of representational properties of associative networks using the channel representation is made.
Finally, a reactive system design using the channel representation is proposed. The system is similar in idea to recursive Bayesian techniques using particle filters, but the present formulation allows learning using the associative networks.
@phdthesis{diva2:244318,
author = {Forss\'{e}n, Per-Erik},
title = {{Low and Medium Level Vision Using Channel Representations}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 858}},
year = {2004},
address = {Sweden},
}
This thesis presents some concepts and methods for low level computer vision and learning, with object recognition as the primary application.
An efficient method for detection of local rotational symmetries in images is presented. Rotational symmetries include circle patterns, star patterns, and certain high curvature patterns. The method for detection of these patterns is based on local moments computed on a local orientation description in double angle representation, which makes the detection invariant to the sign of the local direction vectors. Some methods are also suggested to increase the selectivity of the detection method. The symmetries can serve as feature descriptors and interest points for use in hierarchical matching structures for object recognition and related problems.
A view-based method for 3D object recognition and estimation of object pose from a single image is also presented. The method is based on simple feature vector matching and clustering. Local orientation regions computed at interest points are used as features for matching. The regions are computed such that they are invariant to translation, rotation, and locally invariant to scale. Each match casts a vote on a certain object pose, rotation, scale, and position, and a joint estimate is found by a clustering procedure. The method is demonstrated on a number of real images and the region features are compared with the SIFT descriptor, which is another standard region feature for the same application.
Finally, a new associative network is presented which applies the channel representation for both input and output data. This representation is sparse and monopolar, and is a simple yet powerful representation of scalars and vectors. It is especially suited for representation of several values simultaneously, a property that is inherited by the network and something which is useful in many computer vision problems. The chosen representation enables us to use a simple linear model for non-linear mappings. The linear model parameters are found by solving a least squares problem with a non-negative constraint, which gives a sparse regularized solution.
@phdthesis{diva2:244321,
author = {Johansson, Björn},
title = {{Low Level Operations and Learning in Computer Vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 912}},
year = {2004},
address = {Sweden},
}
Three-dimensional (3D) volume data has become increasingly common with the emergence and wide availability of modern 3D image acquisition techniques. The demand for computerized analysis and visualization techniques is constantly growing to utilize the abundant information embedded in these data.
This thesis consists of three parts. The first part presents methods of analyzing 3D volume data by using second derivatives. Harmonic functions are used to combine the non-orthogonal second derivative operators into an orthogonal basis. Three basic features, magnitude, shape, and orientation, are extracted from the second derivative responses after diagonalizing the Hessian matrix. Two applications on magnetic resonance angiography (MRA) data are presented. One of them utilizes a scale-space and the second order variation to enhance the vascular system by discriminating for string structures. The other one employs the local shape information to detect cases of stenosis.
The second part of the thesis discusses some modifications of the fast marching method in 2D and 3D space. By shifting the input and output grids relative to each other, we show that the sampled cost functions are used in a more consistent way. We present new algorithms for anisotropic fast marching which incorporate orientation information during the marching process. Three applications illustrate the usage of the fast marching methods. The first one extracts a guide wire as a minimum-cost path on a salience distance map of a line detection result of a flouroscopy image. The second application extracts the vascular tree from a whole bodyMRA volume. In the third application, a 3D guide wire is reconstructed from a pair of biplane images using the minimum-cost path formulation.
The third part of the thesis proposes a new frame-coherent volume rendering algorithm. It is an extension of the algorithm by Gudmundsson and Rand´en (1990). The new algorithm is capable of efficiently generating rotation sequences around an arbitrary axis. Essentially, it enables the ray-casting procedure to quickly approach the hull of the object using the so called shadow-lines recorded from the previous frame.
@phdthesis{diva2:302939,
author = {Lin, Qingfen},
title = {{Enhancement, Extraction, and Visualization of 3D Volume Data}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 824}},
year = {2003},
address = {Sweden},
}
This thesis introduces a new signal transform, called polynomial expansion, and based on this develops novel methods for estimation of orientation and motion. The methods are designed exclusively in the spatial domain and can be used for signals of any dimensionality.
Two important concepts in the use of the spatial domain for signal processing is projections into subspaces, e.g. the subspace of second degree polynomials, and representations by frames, e.g. wavelets. It is shown how these concepts can be unified in a least squares framework for representation of nite dimensional vectors by bases, frames, subspace bases, and subspace frames.
This framework is used to give a new derivation of normalized convolution, a method for signal analysis that takes uncertainty in signal values into account and also allows for spatial localization of the analysis functions.
Polynomial expansion is a transformation which at each point transforms the signal into a set of expansion coefficients with respect to a polynomial local signal model. The expansion coefficients are computed using normalized convolution. As a consequence polynomial expansion inherits the mechanism for handling uncertain signals and the spatial localization feature allows good control of the properties of the transform. It is shown how polynomial expansion can be computed very efficiently.
As an application of polynomial expansion, a novel method for estimation of orientation tensors is developed. A new concept for orientation representation, orientation functionals, is introduced and it is shown that orientation tensors can be considered a special case of this representation. By evaluation on a test sequence it is demonstrated that the method performs excellently.
Considering an image sequence as a spatiotemporal volume, velocity can be estimated from the orientations present in the volume. Two novel methods for velocity estimation are presented, with the common idea to combine the orientation tensors over some region for estimation of the velocity field according to a parametric motion model, e.g. affine motion. The first method involves a simultaneous segmentation and velocity estimation algorithm to obtain appropriate regions. The second method is designed for computational efficiency and uses local neighborhoods instead of trying to obtain regions with coherent motion. By evaluation on the Yosemite sequence, it is shown that both methods give substantially more accurate results than previously published methods.
Another application of polynomial expansion is a novel displacement estimation algorithm, i.e. an algorithm which estimates motion from only two consecutive frames rather than from a whole spatiotemporal volume. This approach is necessary when the motion is not temporally coherent, e.g. because the camera is affected by vibrations. It is shown how moving objects can robustly be detected in such image sequences by using the plane+parallax approach to separate out the background motion.
To demonstrate the power of being able to handle uncertain signals it is shown how normalized convolution and polynomial expansion can be computed for interlacedvideo signals. Together with the displacement estimation algorithm this gives a method to estimate motion from a single interlaced frame.
@phdthesis{diva2:302485,
author = {Farnebäck, Gunnar},
title = {{Polynomial expansion for orientation and motion estimation}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 790}},
year = {2002},
address = {Sweden},
}
This thesis presents a new method for detection of complex curvatures such as corners, circles, and star patterns. The method is based on a second degree local polynomial model applied to a local orientation description in double angle representation. The theory of rotational symmetries is used to compute curvature responses from the parameters of the polynomial model. The responses are made more selective using a scheme of inhibition between different symmetry models. These symmetries can serve as feature points at a high abstraction level for use in hierarchical matching structures for 3D estimation, object recognition, image database search, etc.
A very efficient approximative algorithm for single and multiscale polynomial expansion is developed, which is used for detection of the complex curvatures in one or several scales. The algorithm is based on the simple observation that polynomial functions multiplied with a Gaussian function can be described in terms of partial derivatives of the Gaussian. The approximative polynomial expansion algorithm is evaluated in an experiment to estimate local orientation on 3D data, and the performance is comparable to previously tested algorithms which are more computationally expensive.
The curvature algorithm is demonstrated on natural images and in an object recognition experiment. Phase histograms based on the curvature features are developed and shown to be useful as an alternative compact image representation.
The importance of curvature is furthermore motivated by reviewing examples from biological and perceptual studies. The usefulness of local orientation information to detect curvature is also motivated by an experiment about learning a corner detector.
@phdthesis{diva2:312510,
author = {Johansson, Björn},
title = {{Multiscale Curvature Detection in Computer Vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 877}},
year = {2001},
address = {Sweden},
}
The art of medical computed tomography is constantly evolving and the last years have seen new ground breaking systems with multi-row detectors. These tomographs are able to increase both scanning speed and image quality compared to the single-row systems more commonly found in hospitals today. This thesis deals with three-dimensional image reconstruction algorithms to be used in future generations of tomographs with even more detector rows than found in currentmultirow systems.
The first practical algorithm for three-dimensional reconstruction from conebeamprojections acquired from a circular source trajectory is the FDKmethod. We present a novel version of this algorithm that produces images of higher quality. We also formulate a version of the FDK method that performs the backprojection in O(N3 logN) steps instead of the O(N4) steps traditionally required.
An efficient way to acquire volumetric patient data is to use a helical source trajectory together with a multi-row detector. We present an overview of existing reconstruction algorithms for this geometry. We also present a new family of algorithms, the PI methods, which seem to surpass other proposals in simplicity while delivering images of high quality.
The detector used in the PI methods is limited to a window that exactly fits the cylindrical section between two consecutive turns of the helical source path. A rebinning to oblique parallel beams yields a geometry with many attractive properties. The key property behind the simplicity of the PI methods is that each object point to be reconstructed is illuminated by the source during a rotation of exactly half a turn. This allows for fast and simple reconstruction.
@phdthesis{diva2:302800,
author = {Turbell, Henrik},
title = {{Cone-Beam Reconstruction Using Filtered Backprojection}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 672}},
year = {2001},
address = {Sweden},
}
In this thesis a new type of representation for medium level vision operations is explored. We focus on representations that are sparse and monopolar. The word sparse signifies that information in the feature sets used is not necessarily present at all points. On the contrary, most features will be inactive. The word monopolar signifies that all features have the same sign, e.g. are either positive or zero. A zero feature value denotes ``no information'', and for non-zero values, the magnitude signifies the relevance.
A sparse scale-space representation of local image structure (lines and edges) is developed.
A method known as the channel representation is used to generate sparse representations, and its ability to deal with multiple hypotheses is described. It is also shown how these hypotheses can be extracted in a robust manner.
The connection of soft histograms (i.e. histograms with overlapping bins) to the channel representation, as well as to the use of dithering in relaxation of quantisation errors is shown. The use of soft histograms for estimation of unknown probability density functions (PDF), and estimation of image rotation are demonstrated.
The advantage with the use of sparse, monopolar representations in associative learning is demonstrated.
Finally we show how sparse, monopolar representations can be used to speed up and improve template matching.
@phdthesis{diva2:288615,
author = {Forss\'{e}n, Per-Erik},
title = {{Sparse Representations for Medium Level Vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 869}},
year = {2001},
address = {Sweden},
}
Three-dimensional (3D) volume data become increasingly common with the emergence and wide availability of modern 3D image acquisition techniques. The demand for computerized analysis and visualization techniques is constantly growing to utilize the abundant information embedded in these data.
This thesis consists of two parts. The first part presents methods of analyzing 3D volume data by using second derivatives. Harmonic functions are used to combine the non-orthogonal second derivative operators into an orthogonal basis. Three basic features, magnitude, shape, and orientation, are extracted from the second derivative responses after diagonalizing the Hessian matrix. Two applications on magnetic resonance angiography (MRA) data are presented. One of them utilizes a scale-space and the second order variation to enhance the vascular system by discriminating for string structures. The other one employs the local shape information to detect cases of stenosis.
The second part of the thesis proposes a new frame-coherent volume rendering algorithm. It is an extension of the algorithm by Gudmundsson and Randén (1990). The new algorithm is capable of efficiently generating rotation sequences around an arbitrary axis. Essentially, it enables the ray-casting procedure to quickly approach the hull of the object using the so called shadow-lines recorded from the previous frame.
@phdthesis{diva2:288322,
author = {Lin, Qingfen},
title = {{Enhancement, Detection, and Visualization of 3D Volume Data}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 903}},
year = {2001},
address = {Sweden},
}
In this licenciate thesis, we discuss how to generate actions from percepts within an autonomous robotic system. In particular, we discuss and propose an original reactive architecture suitable for response generation, learning and self-organization.
The architecture uses incremental learning and supports self organization through distributed dynamic model generation and self-contained components. Signals to and from the architecture are represented using the channel representation, which is presented in that context.
The components of the architecture use a novel and flexible implementation of an artificial neural network. The learning rules for this implementation are derived.
A simulator is presented. It has been designed and implemented in order to test and evaluate the proposed architecture.
Results of a series of experiments on the reactive architecture are discussed and accounted for. The experiments have been performed within three different scenarios, using the developed simulator.
The problem of information representation in robotic architectures is illustrated by a problem of anchoring symbols to visual data. This is presented in the context of the WITAS project.
@phdthesis{diva2:288283,
author = {Andersson, Thord},
title = {{Learning in a Reactive Robotic Architecture}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 817}},
year = {2000},
address = {Sweden},
}
This thesis presents a number of methods to estimate 3D structures with a single translating camera. The camera is assumed to be calibrated and to have a known translation and rotation.
Applications for aircraft altitude estimation and ground structure estimation ahead of the aircraft are discussed. The idea is to mount a camera on the aircraft and use the motion estimates obtained in the inertia navigation system. One reason for this arrangement is to make the aircraft more passive, in comparison to conventional radar based altitude estimation.
Two groups of methods are considered, optical flow based and region tracking based. Both groups have advantages and drawbacks.
Two methods to estimate the optical flow are presented. The accuracy of the estimated ground structure is increased by varying the temporal distance between the frames used in the optical flow estimation algorithms.
Four region tracking algorithms are presented. Two of them use canonical correlation and the other two are based on sum of squared difference and complex correlation respectively.
The depth estimates are then temporally filtered using weighted least squares or a Kalman filter.
A simple estimation of the computational complexity and memory requirements for the algorithms is presented to aid estimation of the hardware requirements.
Tests on real flight sequences are performed, showing that the aircraft altitude can be estimated with a good accuracy.
@phdthesis{diva2:288278,
author = {Moe, Anders},
title = {{Passive Aircraft Altitude Estimation using Computer Vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 847}},
year = {2000},
address = {Sweden},
}
This thesis presents a framework for estimation of motion fields both for single and multiple layers. All the methods have in common that they generate or use constraints on the local motion. Motion constraints are represented by vectors whose directions describe one component of the local motion and whose magnitude indicate confidence.
Two novel methods for estimating these motion constraints are presented. Both methods take two images as input and apply orientation sensitive quadrature filters. One method is similar to a gradient method applied on the phase from the complex filter outputs. The other method is based on novel results using canonical correlation presented in this thesis.
Parametric models, e.g. affine or FEM, are used to estimate motion from constraints on local motion. In order to estimate smooth fields for models with many parameters, cost functions on deformations are introduced.
Motions of transparent multiple layers are estimated by implicit or explicit clustering of motion constraints into groups. General issues and difficulties in analysis of multiple motions are described. An extension of the known EM algorithm is presented together with experimental results on multiple transparent layers with affine motions. Good accuracy in estimation allows reconstruction of layers using a backprojection algorithm. As an alternative to the EM algorithm, this thesis also introduces a method based on higher order tensors.
A result with potential applicatications in a number of diffeerent research fields is the extension of canonical correlation to handle complex variables. Correlation is maximized using a novel method that can handle singular covariance matrices.
@phdthesis{diva2:302892,
author = {Hemmendorff, Magnus},
title = {{Single and Multiple Motion Field Estimation}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 764}},
year = {1999},
address = {Sweden},
}
In this thesis, novel methods for estimation of orientation and velocity are presented. The methods are designed exclusively in the spatial domain.
Two important concepts in the use of the spatial domain for signal processing is projections into subspaces, e.g. the subspace of second degree polynomials, and representations by frames, e.g. wavelets. It is shown how these concepts can be unified in a least squares framework for representation of finite dimensional vectors by bases, frames, subspace bases, and subspace frames.
This framework is used to give a new derivation of Normalized Convolution, a method for signal analysis that takes uncertainty in signal values into account and also allows for spatial localization of the analysis functions.
With the help of Normalized Convolution, a novel method for orientation estimation is developed. The method is based on projection onto second degree polynomials and the estimates are represented by orientation tensors. A new concept for orientation representation, orientation functionals, is introduced and it is shown that orientation tensors can be considered a special case of this representation. A very efficient implementation of the estimation method is presented and by evaluation on a test sequence it is demonstrated that the method performs excellently.
Considering an image sequence as a spatiotemporal volume, velocity can be estimated from the orientations present in the volume. Two novel methods for velocity estimation are presented, with the common idea to combine the orientation tensors over some region for estimation of the velocity fkield according to a motion model, e.g. affine motion. The first method involves a simultaneous segmentation and velocity estimation algorithm to obtain appropriate regions. The second method is designed for computational efficiency and uses local neighborhoods instead of trying to obtain regions with coherent motion. By evaluation on the Yosemite sequence, it is shown that both methods give substantially more accurate results than previously published methods.
@phdthesis{diva2:302473,
author = {Farnebäck, Gunnar},
title = {{Spatial domain methods for orientation and velocity estimation}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 755}},
year = {1999},
address = {Sweden},
}
The subject of this dissertation is to show how learning can be used for multidimensional signal processing, in particular computer vision. Learning is a wide concept, but it can generally be defined as a system’s change of behaviour in order to improve its performance in some sense.
Learning systems can be divided into three classes: supervised learning, reinforcement learning and unsupervised learning. Supervised learning requires a set of training data with correct answers and can be seen as a kind of function approximation. A reinforcement learning system does not require a set of answers. It learns by maximizing a scalar feedback signal indicating the system’s performance. Unsupervised learning can be seen as a way of finding a good representation of the input signals according to a given criterion.
In learning and signal processing, the choice of signal representation is a central issue. For high-dimensional signals, dimensionality reduction is often necessary. It is then important not to discard useful information. For this reason, learning methods based on maximizing mutual information are particularly interesting.
A properly chosen data representation allows local linear models to be used in learning systems. Such models have the advantage of having a small number of parameters and can for this reason be estimated by using relatively few samples. An interesting method that can be used to estimate local linear models is canonical correlation analysis (CCA). CCA is strongly related to mutual information. The relation between CCA and three other linear methods is discussed. These methods are principal component analysis (PCA), partial least squares (PLS) and multivariate linear regression (MLR). An iterative method for CCA, PCA, PLS and MLR, in particular low-rank versions of these methods, is presented.
A novel method for learning filters for multidimensional signal processing using CCA is presented. By showing the system signals in pairs, the filters can be adapted to detect certain features and to be invariant to others. A new method for local orientation estimation has been developed using this principle. This method is significantly less sensitive to noise than previously used methods.
Finally, a novel stereo algorithm is presented. This algorithm uses CCA and phase analysis to detect the disparity in stereo images. The algorithm adapts filters in each local neighbourhood of the image in a way which maximizes the correlation between the filtered images. The adapted filters are then analysed to find the disparity. This is done by a simple phase analysis of the scalar product of the filters. The algorithm can even handle cases where the images have different scales. The algorithm can also handle depth discontinuities and give multiple depth estimates for semi-transparent images.
@phdthesis{diva2:302872,
author = {Borga, Magnus},
title = {{Learning Multidimensional Signal Processing}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 531}},
year = {1998},
address = {Sweden},
}
The thesis describes novel methods for image motion computation and template matching.
A multiscale algorithm for energy-based estimation and representation of local spatiotemporal structure by second order symmetric tensors is presented. An efficient spatiotemporal implementation of a signalmodellingmethod called normalized convolution is described. This provides a means to handle signals with varying degree of reliability.
As an application of the above results, a smooth pursuit motion tracking algorithm that uses observations of both targetmotion and position for camera head control and motion prediction is described. The target is detected using a novel motion field segmentation algorithm which assumes that the motion fields of the target and its immediate vicinity, at least occasionally, each can be modelled by a single parameterized motion model. A method to eliminate camera-induced background motion in the case of a pan/tilt rotating camera is suggested.
In a second application, a high-precision image motion estimation algorithm performing clustering in motion parameter space is developed. The algorithm, which can handle multiple motions by simultaneous motion parameter estimation and image segmentation, iteratively maximizes the posterior probability of the motion parameter set given the observed local spatiotemporal structure tensor field. The probabilistic formulation provides a natural way to incorporate additional prior information about the segmentation of the scene into the objective function. A simple homotopy continuation method (embedding algorithm) is used to increase the likelihood of convergence to a nearoptimal solution.
The final part of the thesis is concerned with tracking of (partially) occluded targets. An algorithm for target tracking in head-up display sequences is presented. The method generalizes cross-correlation coefficient matching by introducing a signal confidencebased distance metric. To handle target shape changes, a method for template mask shape-adaptation based on geometric transformation parameter optimisation is introduced. The presence of occluding objects makes local structure descriptors (e.g., the gradient) unreliable, which means that only pixelwise comparisons of target and template can be made, unless the local structure operators are modified to take into account the varying signal certainty. Normalized convolution provides the means for such a modification. This is demonstrated in a section on phase-based target tracking, which also contains a presentation of a generic method for tracking of occluded targets by combining normalized convolution with iterative reweighting.
@phdthesis{diva2:302807,
author = {Karlholm, Jörgen},
title = {{Local Signal Models for Image Sequence Analysis}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 536}},
year = {1998},
address = {Sweden},
}
Reinforcement learning is a general and powerful way to formulate complex learning problems and acquire good system behaviour. The goal of a reinforcement learning system is to maximize a long term sum of instantaneous rewards provided by a teacher. In its extremum form, reinforcement learning only requires that the teacher can provide a measure of success. This formulation does not require a training set with correct responses, and allows the system to become better than its teacher.
In reinforcement learning much of the burden is moved from the teacher to the training algorithm. The exact and general algorithms that exist for these problems are based on dynamic programming (DP), and have a computational complexity that grows exponentially with the dimensionality of the state space. These algorithms can only be applied to real world problems if an efficient encoding of the state space can be found.
To cope with these problems, heuristic algorithms and function approximation need to be incorporated. In this thesis it is argued that local models have the potential to help solving problems in high-dimensional spaces and that global models have not. This is motivated with the biasvariance dilemma, which is resolved with the assumption that the system is constrained to live on a low-dimensional manifold in the space of inputs and outputs. This observation leads to the introduction of bias in terms of continuity and locality.
A linear approximation of the system dynamics and a quadratic function describing the long term reward are suggested to constitute a suitable local model. For problems involving one such model, i.e. linear quadratic regulation problems, novel convergence proofs for heuristic DP algorithms are presented. This is one of few available convergence proofs for reinforcement learning in continuous state spaces.
Reinforcement learning is closely related to optimal control, where local models are commonly used. Relations to present methods are investigated, e.g. adaptive control, gain scheduling, fuzzy control, and jump linear systems. Ideas from these areas are compiled in a synergistic way to produce a new algorithm for heuristic dynamic programming where function parameters and locality, expressed as model applicability, are learned on-line. Both top-down and bottom-up versions are presented.
The emerging local models and their applicability need to be memorized by the learning system. The binary tree is put forward as a suitable data structure for on-line storage and retrieval of these functions.
@phdthesis{diva2:302961,
author = {Landelius, Tomas},
title = {{Reinforcement Learning and Distributed Local Model Synthesis}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 469}},
year = {1997},
address = {Sweden},
}
The thesis describes novel methods for efficient spatiotemporal filtering and modeling. A multiresolution algorithm for energy-based estimation and representation of local spatiotemporal structure by second order symmetric tensors is presented. The problem of how to properly process estimates with varying degree of reliability is addressed. An efficient spatiotemporal implementation of a certainty-based signal modeling method called normalized convolution is described. As an application of the above results, a smooth pursuit motion tracking algorithm that uses observations of both target motion and position for camera head control and motion prediction is described. The target is detected using a novel motion field segmentation algorithm which assumes that the motion fields of the target and its immediate vicinity, at least occasionally, each can be modeled by a single parameterized motion model. A method to eliminate camera-induced background motion in the case of a pan/tilt rotating camera is suggested.
@phdthesis{diva2:288261,
author = {Karlholm, Jörgen},
title = {{Efficient Spatiotemporal Filtering and Modelling}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 562}},
year = {1996},
address = {Sweden},
}
This thesis deals with focus of attention control in active vision systems. A framework for hierarchical gaze control in a robot vision system is presented, and an implementation for a simulated robot is described. The robot is equipped with a heterogeneously sampled imaging system, a fovea, resembling the spatially varying resolution of a human retina. The relation between foveas and multiresolution image processing as well as implications for image operations are discussed.
A stereo algorithm based on local phase differences is presented both as a stand alone algorithm and as a part of a robot vergence control system. The algorithm is fast and can handle large disparities and maintaining subpixel accuracy. The method produces robust and accurate estimates of displacement on synthetic as well as real life stereo images. Disparity filter design is discussed and a number of filters are tested, e.g. Gabor filters and lognorm quadrature filters. A design method for disparity filters having precisely one phase cycle is also presented.
A theory for sequentially defined data modified focus of attention is presented. The theory is applied to a preattentive gaze control system consisting of three cooperating control strategies. The first is an object finder that uses circular symmetries as indications for possible object and directs the fixation point accordingly. The second is an edge tracker that makes the fixation point follow structures in the scene. The third is a camera vergence control system which assures that both eyes are fixating on the same point. The coordination between the strategies is handled using potential fields in the robot parameter space.
Finally, a new focus of attention method for disregarding filter responses from already modelled structures is presented. The method is based on a filtering method, normalized convolution, originally developed for filtering incomplete and uncertain data. By setting the certainty of the input data to zero in areas of known or predicted signals, a purposive removal of operator responses can be obtained. On succeeding levels, image features from these areas become 'invisible' and consequently do not attract the attention of the system. This technique also allows the system to effectively explore new events. By cancelling known, or modeled, signals the attention of the system is shifted to new events not yet described.
@phdthesis{diva2:302463,
author = {Westelius, Carl-Johan},
title = {{Focus of attention and gaze control for robot vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 379}},
year = {1995},
address = {Sweden},
}
In this thesis, the theory of reinforcement learning is described and its relation to learning in biological systems is discussed. Some basic issues in reinforcement learning, the credit assignment problem and perceptual aliasing, are considered. The methods of temporal difference are described. Three important design issues are discussed: information representation and system architecture, rules for improving the behaviour and rules for the reward mechanisms. The use of local adaptive models in reinforcement learning is suggested and exemplified by some experiments. This idea is behind all the work presented in this thesis. A method for learning to predict the reward called the prediction matrix memory is presented. This structure is similar to the correlation matrix memory but differs in that it is not only able to generate responses to given stimuli but also to predict the rewards in reinforcement learning. The prediction matrix memory uses the channel representation, which is also described. A dynamic binary tree structure that uses the prediction matrix memories as local adaptive models is presented. The theory of canonical correlation is described and its relation to the generalized eigenproblem is discussed. It is argued that the directions of canonical correlations can be used as linear models in the input and output spaces respectively in order to represent input and output signals that are maximally correlated. It is also argued that this is a better representation in a response generating system than, for example, principal component analysis since the energy of the signals has nothing to do with their importance for the response generation. An iterative method for finding the canonical correlations is presented. Finally, the possibility of using the canonical correlation for response generation in a reinforcement learning system is indicated.
@phdthesis{diva2:288543,
author = {Borga, Magnus},
title = {{Reinforcement Learning Using Local Adaptive Models}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 507}},
year = {1995},
address = {Sweden},
}
This thesis presents a signal representation in terms of operators. The signal is assumed to be an element of a vector space and subject to transformations of operators. The operators form continuous groups, so-called Lie groups. The representation can be used for signals in general, in particular if spatial relations are undefinied and it does not require a basis of the signal space to be useful.
Special attention is given to orthogonal operator groups which are generated by anti-Hermitian operators by means of the exponential mapping. It is shown that the eigensystem of the group generator is strongly related to properties of the corresponding operator group. For one-parameter orthogonal operator groups, a phase concept is introduced. This phase can for instance be used to distinguish between spatially even and odd signals and, therefore, corresponds to the usual phase for multi-dimensional signals.
Given one operator group that represents the variation of the signal and one operator group that represents the variation of a corresponding feature descriptor, an equivariant mapping maps the signal to the descriptor such that the two operator groups correspond. Suficient conditions are derived for a general mapping to be equivariant with respect to a pair of operator groups. These conditions are expressed in terms of the generators of the two operator groups. As a special case, second order homo-geneous mappings are considered, and examples of how second order mappings can be used to obtain different types of feature descriptors are presented, in particular for operator groups that are homomorphic to rotations in two and three dimensions, respectively. A generalization of directed quadrature lters is made. All feature extraction algorithms that are presented are discussed in terms of phase invariance.
Simple procedures that estimate group generators which correspond to one-parameter groups are derived and tested on an example. The resulting generator is evaluated by using its eigensystem in implementations of two feature extraction algorithms. It is shown that the resulting feature descriptor has good accuracy with respect to the corresponding feature value, even in the presence of signal noise.
@phdthesis{diva2:302847,
author = {Nordberg, Klas},
title = {{Signal Representation and Processing using Operator Groups}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 366}},
year = {1994},
address = {Sweden},
}
This thesis deals with ltering of multidimensional signals. A large part of the thesis is devoted to a novel filtering method termed "Normalized convolution". The method performs local expansion of a signal in a chosen lter basis which not necessarily has to be orthonormal. A key feature of the method is that it can deal with uncertain data when additional certainty statements are available for the data and/or the lters. It is shown how false operator responses due to missing or uncertain data can be significantly reduced or eliminated using this technique. Perhaps the most well-known of such eects are the various 'edge effects' which invariably occur at the edges of the input data set. The method is an example of the signal/certainty - philosophy, i.e. the separation of both data and operator into a signal part and a certainty part. An estimate of the certainty must accompany the data. Missing data are simply handled by setting the certainty to zero. Localization or windowing of operators is done using an applicability function, the operator equivalent to certainty, not by changing the actual operator coefficients. Spatially or temporally limited operators are handled by setting the applicability function to zero outside the window.
The use of tensors in estimation of local structure and orientation using spatiotemporal quadrature filters is reviewed and related to dual tensor bases. The tensor representation conveys the degree and type of local anisotropy. For image sequences, the shape of the tensors describe the local structure of the spatiotemporal neighbourhood and provides information about local velocity. The tensor representation also conveys information for deciding if true flow or only normal flow is present. It is shown how normal flow estimates can be combined into a true flow using averaging of this tensor eld description.
Important aspects of representation and techniques for grouping local orientation estimates into global line information are discussed. The uniformity of some standard parameter spaces for line segmentation is investigated. The analysis shows that, to avoid discontinuities, great care should be taken when choosing the parameter space for a particular problem. A new parameter mapping well suited for line extraction, the Möbius strip parameterization, is de ned. The method has similarities to the Hough Transform.
Estimation of local frequency and bandwidth is also discussed. Local frequency is an important concept which provides an indication of the appropriate range of scales for subsequent analysis. One-dimensional and two-dimensional examples of local frequency estimation are given. The local bandwidth estimate is used for dening a certainty measure. The certainty measure enables the use of a normalized averaging process increasing robustness and accuracy of the frequency statements.
@phdthesis{diva2:302457,
author = {Westin, Carl-Fredrik},
title = {{A Tensor Framework for Multidimensional Signal Processing}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 348}},
year = {1994},
address = {Sweden},
}
The work presented in this thesis is based on the basic idea of learning by reinforcement, within the theory of behaviorism. The reason for this choice is the generality of such an approach, especially that the reinforcement learning paradigm allows systems to be designed which can improve their behavior beyond that of their teacher. The role of the teacher is to define the reinforcement function, which acts as a description of the problem the machine is to solve.
Learning is considered to be a bootstrapping procedure. Fragmented past experience, of what to do when performing well, is used for response generation. The new response, in its turn, adds more information to the system about the environment. Gained knowledge is represented by a behavior probability density function. This density function is approximated with a number of normal distributions which are stored in the nodes of a binary tree. The tree structure is grown by applying a recursive algorithm to the stored stimuli-response combinations, called decisions. By considering both the response and the stimulus, the system is able to bring meaning to structures in the input signal. The recursive algorithm is first applied to the whole set of stored decisions. A mean decision vector and a covariance matrix are calculated and stored in the root node. The decision space is then partitioned into two halves across the direction of maximal data variation. This procedure is now repeated recursively for each of the two halves of the decision space, forming a binary tree with mean vectors and covariance matrices in its nodes.
The tree is the system's guide to response generation. Given a stimulus, the system searches for responses likely to result in highly reinforced decisions. This is accomplished by treating the sum of the normal distributions in the leaves as distribution describing the behavior of the system. The sum of normal distributions, with the current stimulus held fixed, is finally used for random generation of the response.
This procedure makes it possible for the system to have several equally plausible responses to one stimulus. Not applying maximum likelihood principles will make the system more explorative and reduce its risk of being trapped in local minima.
The performance and complexity of the learning tree is investigated and compared to some well known alternative methods. Presented are also some simple, yet principally important, experiments verifying the behavior of the proposed algorithm.
@phdthesis{diva2:288301,
author = {Landelius, Tomas},
title = {{Behavior Representation by Growing a Learning Tree}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 397}},
year = {1993},
address = {Sweden},
}
A framework for a hierarchical approach to gaze control of a robot vision system is presented, and a implementation on a simulated robot is described. The robot is equipped with a heterogeneous sampled imaging system, a fovea, resembling the varying resolution in a human retina. The relation between the fovea and multiresolution image processing is discussed together with implications for image operations.
A stereo algorithm based on local phase differences is presented both as a stand alone algorithm and as apart of a vergence control system for the robot above. The algorithm is fast and can handle !arge disparities and still give subpixel accuracy. The algorithm uses a wavelet approach beginning at a coarse resolution and refining the disparity estimates while increasing the resolution. The method produces robust and accurate estimates of displacement on synthetic as well as real life stereo images. Disparity filter design is discussed anda number of filters are tested e.g. Gabor filters and lognorm quadrature filters. A special disparity filter designed to have only one phase cycle is also presented.
A theory for sequentially defined, data modified focus of attention, based on nested regions of interest is presented. The theory is applied to preattentive gaze control system consisting of three control levels: camera vergence, edge tracking and object finding. The object finder uses circular symmetries as indications on possible object and directs the fixation point accordingly. The edge tracker make the fixation point follow the structures in the scene and the camera vergence control assures that both eyes are fixating the same point. The coordination between the levels is handled with potential fields in the robot parameter space.
@phdthesis{diva2:311047,
author = {Westelius, Carl-Johan},
title = {{Preattentive gaze control for robot vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 322}},
year = {1992},
address = {Sweden},
}
This thesis concerns robust estimation of low-level features for use in computer vision systems. The presentation consists of two parts.
The first part deals with controllable filters and models. A basis filter set is introduced which supports a computationally efficient synthesis of filters in arbitrary orientations. In contrast to many earlier methods, this approach allows the use of more complex models at an early stage of the processing. A new algorithm for robust estimation of orientation is presented. The algorithm is based on synthesized quadrature responses and supports the simultaneous representation and individual averaging of multiple events. These models are then extended to include estimation and representation of more complex image primitives such as as line ends, T-junctions, crossing lines and curvature. The proposed models are based on symmetry properties in the Fourier domain as well as in the spatial plane and the feature extraction is performed by applying the original basis filters directly on the grey-level image. The basis filters and interpolation scheme are finally generalized to allow synthesis of 3-D filters. The performance of the proposed models and algorithms is demonstrated using test images of both synthetic and real world data.
The second part of the thesis concerns an image feature representation adapted for a robust analogue implementation. A possible use for this approach is in analogue VLSI or corresponding analogue hardware adapted for neural networks. The methods are based on projections of quadrature filter responses and mutual inhibition of magnitude signals.
@phdthesis{diva2:302868,
author = {Andersson, Mats T.},
title = {{Controllable Multi-dimensional Filters and Models in Low-Level Computer Vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 282}},
year = {1992},
address = {Sweden},
}
Feature extraction from a tensor based local image representation introduced by Knutsson in [37] is discussed. The tensor representation keeps statements of structure, certainty of statement and energy separate. Further processing for obtaining new features also having these three entities separate is achieved by the use of a new concept, tensor field filtering. Tensor filters for smoothing and for extraction of circular symmetries are presented and discussed in particular. These methods are used for corner detection and extraction of more global features such as lines in images. A novel method for grouping local orientation estimates into global line parameters is introduced. The method is based on a new parameter space, the Möbius Strip parameter space, which has similarities to the Hough transform. A local centroid clustering algorithm is used for classification in this space. The procedure automatically divides curves into line segments with appropriate lengths depending on the curvature. A linked list structure is built up for storing data in an efficient way.
@phdthesis{diva2:311051,
author = {Westin, Carl-Fredrik},
title = {{Feature extraction based on a tensor image description}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 288}},
year = {1991},
address = {Sweden},
}
This thesis concerns the estimation and description of curvature for computer vision applications. Different types of multi-dimensional data are considered: images (2D); volumes (3D); time sequences of images (3D); and time sequences of volumes (4D).
The methods are based on local Fourier domain models and use local operations such as filtering. A hierarchical approach is used. Firstly, the local orientation is estimated and represented with a vector field equivalent description. Secondly, the local curvature is estimated from the orientation description. The curvature algorithms are closely related to the orientation estimation algorithms and the methods as a whole give a unified approach to the estimation and description of orientation and curvature. In addition, the methodology avoids thresholding and premature decision making.
Results on both synthetic and real world data are presented to illustrate the algorithms performance with respect to accuracy and noise insensitivity. Examples illustrating the use of the curvature estimates for tasks such as image enhancement are also included.
@phdthesis{diva2:311049,
author = {Bårman, Håkan},
title = {{Hierarchical curvature estimation in computer vision}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 253}},
year = {1991},
address = {Sweden},
}
This thesis contains a presentation and an analysis of adaptive filtering strategies for multidimensional data. The size, shape and orientation of the flter are signal controlled and thus adapted locally to each neighbourhood according to a predefined model. The filter is constructed as a linear weighting of fixed oriented bandpass filters having the same shape but different orientations. The adaptive filtering methods have been tested on both real data and synthesized test data in 2D, e.g. still images, 3D, e.g. image sequences or volumes, with good results. In 4D, e.g. volume sequences, the algorithm is given in its mathematical form. The weighting coefficients are given by the inner products of a tensor representing the local structure of the data and the tensors representing the orientation of the filters.
The procedure and lter design in estimating the representation tensor are described. In 2D, the tensor contains information about the local energy, the optimal orientation and a certainty of the orientation. In 3D, the information in the tensor is the energy, the normal to the best ftting local plane and the tangent to the best fitting line, and certainties of these orientations. In the case of time sequences, a quantitative comparison of the proposed method and other (optical flow) algorithms is presented.
The estimation of control information is made in different scales. There are two main reasons for this. A single filter has a particular limited pass band which may or may not be tuned to the different sized objects to describe. Second, size or scale is a descriptive feature in its own right. All of this requires the integration of measurements from different scales. The increasing interest in wavelet theory supports the idea that a multiresolution approach is necessary. Hence the resulting adaptive filter will adapt also in size and to different orientations in different scales.
@phdthesis{diva2:302863,
author = {Haglund, Leif},
title = {{Adaptive Multidimensional Filtering}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 284}},
year = {1991},
address = {Sweden},
}
This paper presents a new method for detection and estimation of curvature. The algorithm is implemented in the hierarchical feature pyramid proposed in the GOP concept. Curvature is handled at the second level of the pyramid with a vector field description of the orientation of the image as input. This complex image is convolved with typically eight filters. The filter responses are combined into a description of curvature direction, curvature magnitude and curvature/linearity ratio. The procedure resembles in many ways the algorithms for the first level of the feature pyramid and seems to be a natural extension of these. The method is easy to implement and the tests made show that it performs well and can handle noisy conditions. Some comparisons with other algorithms have been carried out, and the results indicate that the methodology presented in this paper has a number of important advantages over other methods.
@phdthesis{diva2:288627,
author = {Bårman, Håkan},
title = {{Curvature Estimation and Description}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 167}},
year = {1989},
address = {Sweden},
}
@phdthesis{diva2:288315,
author = {Andersson, Mats},
title = {{Image Feature Representation for Analogue VLSI Representation}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 193}},
year = {1989},
address = {Sweden},
}
Scale analysis and description has over the last years become one of the major research field!<! in image processing. There are two main reasons for this. A single filter has a particular limited pass band which may or may not be tuned to the different sized objects to describe. Second, size or scale is a descriptive feature in its own right. All of this requires the integration of measurements from different scales.
The thesis describes a new algorithm which detects in what scale an event appears and also in what scale it disappears. In this way the scale space is subdivided into a number of intervals. Within each scale interval a consistency check is performed to get the certainty of the detection. It will be shown that using a three-dimensional phase representation of image data, it is possible to do both the subdivision and the consistency check in a simple manner. The scale levels between different events are detected when a certain dot product becomes negative and the consistency will be a vector summation between these scales. The specific levels where a split of scale space occurs will, of course, be contextually dependent and there will also be different numbers of levels in different parts of the images. Finally an application of size description of this information will be described.
@phdthesis{diva2:288312,
author = {Haglund, Leif},
title = {{Hierarchical Scale Analysis of Images Using Phase Description}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 168}},
year = {1989},
address = {Sweden},
}
The extraction of features is necessary for all aspects of image processing and analysis such as classification, segmentation, enhancement and coding. In the course of developing models to describe images, a need arises for description of more complex structures than lines. This need does not reject the importance of line structures but indicates the need to complement and utilize them in a more systematic way.
In this thesis, some new methods for extraction of local symmetry features as well as experimental results and applications are presented. The local images are expanded in terms of orthogonal functions with iso-value curves being harmonic functions. Circular, linear, hyperbolic and parabolic structures are studied in particular and some two-step algorithms involving only convolutions are given for detection purposes. Confidence measures with a reliability verified by both theoretical and experimental studies, are proposed. The method is extended to symmetric patterns fulfilling certain general conditions. It is shown that in the general case the resulting algorithms are implementable through the same computing schemes used for detection of linear structures except for a use of different filters.
Multidimensional linear symmetry is studied and an application problem in 3-D or in particular, optical flow, and the solution proposed by this general framework is presented. The solution results in a closed form algorithm consisting of two steps, in which spatio-temporal gradient and Gaussian filtering are performed. The result consists of an optical flow estimate minimizing the linear symmetry criterion and a confidence measure based on the minimum error. The frequency band sensitivity of the obtained results is found to be possible to control. Experimental results are presented.
@phdthesis{diva2:311053,
author = {Bigün, Josef},
title = {{Local symmetry features in image processing}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 179}},
year = {1988},
address = {Sweden},
}
There are a number of different algorithms for object motion estimation in image sequences. Almost every algorithm is based on one of three different mathematical methods. An overview of these methods is given, together with some published application examples on object tracking.
A method for tracking of multiple moving objects has been developed on a GOP 300 image processing system. This method works on image sequences with a stationary background, and can be divided into the following steps:
- Find the positions for all objects that have moved.
- Predict the new positions for all known objects.
- Match these two sets of points.
- Produce the required output.
These steps are repeated for every sample of the sequence. As an output from every processed sample in a test sequence, both a resulting image and a record in a datafile have been generated. The resulting image is a copy of the actual sample with the active object identities overlayed at the corresponding positions. The resulting images have been stored on a video tape.
@phdthesis{diva2:274283,
author = {Wiklund, Johan},
title = {{Image Sequence Analysis for Tracking of Moving Objects}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 107}},
year = {1987},
address = {Sweden},
}
New methods for feature extraction based on the spectral properties of local neighbourhoods is presented. The spectral behaviour of the neighbourhoods is investigated in the spatial domain using the Parseval relation applied to partial derivative pictures. Two types of such properties are considered for circular symmetric and linear symmetric neighbourhoods. These two properties are the existence of point concentration and line concentration in the spectra. For the circular symmetry investigation a new basis function set is introduced. To obtain a spectrum in the terms of these basis function sets, a scalar product is introduced for circular neighbourhoods. The same is carried out for linear symmetry spectra using the well-known basis set and the scalar product consisting of cosines and 𝓛2(Ω) scalar product. Confidence parameters are introduced to measure the significance of the extracted features. These are basically different types of variance measures and they are shown to be specific for the desired information: The existence of point concentration or line concentration in the spectra of the local neighbourhoods.
@phdthesis{diva2:288297,
author = {Bigun, Josef},
title = {{Circular Symmetry Models in Image Processing}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Thesis No. 85}},
year = {1986},
address = {Sweden},
}
Image processing is a broad field posing a wide range of problems. The Work presented in this dissertation is mainly concerned with filter design subject to different criteria and constraints.
The first part describes the development of a new radiographic reconstruction method designated Ectomography. The method is novel in that it allows reconstruction of an arbitrarily thick layer of an object using limited viewing angle.
The subject of the second part is estimation and filtering of local image information. Quadrature filters are designed enabling accurate orientation and frequency estimates. The extracted information is shown to provide a good basis for efficient image enhancement and coding procedures.
@phdthesis{diva2:311054,
author = {Knutsson, Hans},
title = {{Filtering and reconstruction in image processing}},
school = {Linköping University},
type = {{Linköping Studies in Science and Technology. Dissertations No. 88}},
year = {1982},
address = {Sweden},
}
Other
This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on multiple benchmarks, including THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on all datasets, achieving gains as high as 2.3% in terms of mAP at IoU=0.5 on THUMOS14
@misc{diva2:1600816,
author = {Narayan, Sanath and Cholakkal, Hisham and Hayat, Munawar and Khan, Fahad Shahbaz and Yang, Ming-Hsuan and Shao, Ling},
title = {{D2-Net}},
howpublished = {Weakly-Supervised Action Localization via Discriminative Embeddingsand Denoised Activations},
year = {2021},
}
Deep neural networks have achieved remarkable performance on a range of classification tasks, with softmax cross-entropy (CE) loss emerging as the de-facto objective function. The CE loss encourages features of a class to have a higher projection score on the true class-vector compared to the negative classes. However, this is a relative constraint and does not explicitly force different class features to be well-separated. Motivated by the observation that ground-truth class representations in CE loss are orthogonal (one-hot encoded vectors), we develop a novel loss function termed `Orthogonal Projection Loss' (OPL) which imposes orthogonality in the feature space. OPL augments the properties of CE loss and directly enforces inter-class separation alongside intra-class clustering in the feature space through orthogonality constraints on the mini-batch level. As compared to other alternatives of CE, OPL offers unique advantages e.g., no additional learnable parameters, does not require careful negative mining and is not sensitive to the batch size. Given the plug-and-play nature of OPL, we evaluate it on a diverse range of tasks including image recognition (CIFAR-100), large-scale classification (ImageNet), domain generalization (PACS) and few-shot learning (miniImageNet, CIFAR-FS, tiered-ImageNet and Meta-dataset) and demonstrate its effectiveness across the board. Furthermore, OPL offers better robustness against practical nuisances such as adversarial attacks and label noise.
@misc{diva2:1600814,
author = {Ranasinghe, Kanchana and Naseer, Muzammal and Hayat, Munawar and Khan, Salman and Khan, Fahad Shahbaz},
title = {{Orthogonal Projection Loss}},
howpublished = {},
year = {2021},
}
Multi-label zero-shot learning (ZSL) is a more realistic counter-part of standard single-label ZSL since several objects can co-exist in a natural image. However, the occurrence of multiple objects complicates the reasoning and requires region-specific processing of visual features to preserve their contextual cues. We note that the best existing multi-label ZSL method takes a shared approach towards attending to region features with a common set of attention maps for all the classes. Such shared maps lead to diffused attention, which does not discriminatively focus on relevant locations when the number of classes are large. Moreover, mapping spatially-pooled visual features to the class semantics leads to inter-class feature entanglement, thus hampering the classification. Here, we propose an alternate approach towards region-based discriminability-preserving multi-label zero-shot classification. Our approach maintains the spatial resolution to preserve region-level characteristics and utilizes a bi-level attention module (BiAM) to enrich the features by incorporating both region and scene context information. The enriched region-level features are then mapped to the class semantics and only their class predictions are spatially pooled to obtain image-level predictions, thereby keeping the multi-class features disentangled. Our approach sets a new state of the art on two large-scale multi-label zero-shot benchmarks: NUS-WIDE and Open Images. On NUS-WIDE, our approach achieves an absolute gain of 6.9% mAP for ZSL, compared to the best published results.
@misc{diva2:1600810,
author = {Narayan, Sanath and Gupta, Akshita and Khan, Salman and Khan, Fahad Shahbaz and Shao, Ling and Shah, Mubarak},
title = {{Discriminative Region-based Multi-Label Zero-Shot Learning}},
howpublished = {},
year = {2021},
}
We propose a novel transformer-based styled handwritten text image generation approach, HWT, that strives to learn both style-content entanglement as well as global and local writing style patterns. The proposed HWT captures the long and short range relationships within the style examples through a self-attention mechanism, thereby encoding both global and local style patterns. Further, the proposed transformer-based HWT comprises an encoder-decoder attention that enables style-content entanglement by gathering the style representation of each query character. To the best of our knowledge, we are the first to introduce a transformer-based generative network for styled handwritten text generation. Our proposed HWT generates realistic styled handwritten text images and significantly outperforms the state-of-the-art demonstrated through extensive qualitative, quantitative and human-based evaluations. The proposed HWT can handle arbitrary length of text and any desired writing style in a few-shot setting. Further, our HWT generalizes well to the challenging scenario where both words and writing style are unseen during training, generating realistic styled handwritten text images.
@misc{diva2:1600807,
author = {Bhunia, Ankan Kumar and Khan, Salman and Cholakkal, Hisham and Anwer, Rao Muhammad and Khan, Fahad Shahbaz and Shah, Mubarak},
title = {{Handwriting Transformers}},
howpublished = {},
year = {2021},
}
Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosityabout these unknown instances aids in learning about them,when the corresponding knowledge is eventually available.This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a modelis tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to doso, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, whenthe corresponding labels are progressively received. Weformulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE:Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacyof ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterisingunknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-ofthe-art performance, with no extra methodological effort.We hope that our work will attract further research into thisnewly identified, yet crucial research direction.
@misc{diva2:1600805,
author = {Joseph, KJ and Khan, Salman and Khan, Fahad Shahbaz and Balasubramanian, Vineeth N},
title = {{Towards Open World Object Detection}},
howpublished = {},
year = {2021},
}
This book contains material for an introductory course on homogeneous representations for geometry in 2 and 3 dimensions, camera projections, representations of 3D rotations, epipolar geometry, and estimation of various type of geometric objects. Based on these results, a set of applications are presented. It also contains a toolbox of general results that are useful for the presented material. The book is intended for undergraduate studies at advanced level in master programs, or in PhD-courses at introductory level.
@misc{diva2:1136229,
author = {Nordberg, Klas},
title = {{Introduction to Representations and Estimation in Geometry}},
howpublished = {},
year = {2018},
}
@misc{diva2:275356,
author = {Johansson, Björn},
title = {{Rotational Symmetries, a Quick Tutorial}},
howpublished = {},
year = {2001},
}
Reports
The purpose of this document is to reect on novel and upcoming methods for computer vision that might have relevance for application in robot vision and video analytics. The document covers many dierent sub-elds of computer vision, most of which have been addressed by our research activity at the computer vision laboratory. The report has been written based on a request of, and supported by, FOI.
@techreport{diva2:1165440,
author = {Felsberg, Michael},
title = {{Five years after the Deep Learning revolution of computer vision:
State of the art methods for online image and video analysis}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2017},
type = {Other academic},
number = {, },
address = {Sweden},
}
The Exponential Linear Unit (ELU) has been proven to speed up learning and improve the classification performance over activation functions such as ReLU and Leaky ReLU for convolutional neural networks. The reasons behind the improved behavior are that ELU reduces the bias shift, it saturates for large negative inputs and it is continuously differentiable. However, it remains open whether ELU has the optimal shape and we address the quest for a superior activation function.
We use a new formulation to tune a piecewise linear activation function during training, to investigate the above question, and learn the shape of the locally optimal activation function. With this tuned activation function, the classification performance is improved and the resulting, learned activation function shows to be ELU-shaped irrespective if it is initialized as a RELU, LReLU or ELU. Interestingly, the learned activation function does not exactly pass through the origin indicating that a shifted ELU-shaped activation function is preferable. This observation leads us to introduce the Shifted Exponential Linear Unit (ShELU) as a new activation function.
Experiments on Cifar-100 show that the classification performance is further improved when using the ShELU activation function in comparison with ELU. The improvement is achieved when learning an individual bias shift for each neuron.
@techreport{diva2:1154026,
author = {Grelsson, Bertil and Felsberg, Michael},
title = {{Performance boost in Convolutional Neural Networks by tuning shifted activation functions}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2017},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:1083263,
author = {Eldesokey, Abdelrahman},
title = {{Normalized Convolutional Neural Networks for Sparse Data}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2017},
type = {Other academic},
number = {LiTH-ISY-R, 3096},
address = {Sweden},
}
A common computer vision task is navigation and mapping. Many indoor navigation tasks require depth knowledge of flat, unstructured surfaces (walls, floor, ceiling). With passive illumination only, this is an ill-posed problem. Inspired by small children using a torchlight, we use a spotlight for active illumination. Using our torchlight approach, depth and orientation estimation of unstructured, flat surfaces boils down to estimation of ellipse parameters. The extraction of ellipses is very robust and requires little computational effort.
@techreport{diva2:650756,
author = {Felsberg, Michael and Larsson, Fredrik and Wang, Han and Ynnerman, Anders and Schön, Thomas},
title = {{Torchlight Navigation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2011},
type = {Other academic},
number = {LiTH-ISY-R, 3004},
address = {Sweden},
}
This report introduces some simplications to the method by Fitzgibbon et al. that allows for 3D model construction from turn-table sequences. It is assumed that the reader has previously read in order to fully understand this report.
Fitzgibbon et al. presents a method for 3D model construction that utilizes the extra constraints imposed by turn-table sequences. Restricting the scenario to a turn-table sequence with a single camera with xed settings produces these extra constraints:
C1. The internal parameters for the camera are the same for all images
C2. The motion of the camera can be described by a rotation around a singleaxis
It is shown that in the uncalibrated case the number of parameters to estimate is m + 8 where m is the number of images.
We further simplify the problem by using extra constrains given from the fact that we know:
C3. The internal parameters of the camera, i.e the K matrix
C4. That the angle between each pair of consecutive cameras is the same
Using these extra simplications makes it possible to create a 3D model from realistic data without using Bundle Adjustment.
@techreport{diva2:434353,
author = {Larsson, Fredrik},
title = {{Automatic 3D Model Construction for Turn-Table Sequences - A Simplification}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2011},
type = {Other academic},
number = {LiTH-ISY-R, 3022},
address = {Sweden},
}
We will present the basic theory for the camera geometry. Our goal is camera calibration and the tools necessary for this. We start with homogeneous matrices that can be used to describe geometric transformations in a simple manner. Then we consider the pinhole camera model, the simplified camera model that we will show how to calibrate.
A camera matrix describes the mapping from the 3D world to a camera image. The camera matrix can be determined through a number of corresponding points measured in the world and the image. We also demonstrate the common special case of camera calibration when it can be assumed that the world is flat. Then, a plane in the world is transformed to the image plane. Such a plane-to-plane mapping is called a homography.
Finally, we discuss some useful mathematical tools needed for camera calibration. We show that the solution we present for the determination of the camera matrix is equivalent to a least-squares solution. We also show how to solve a homogeneous system of equations using SVD (singular value decomposition).
@techreport{diva2:693117,
author = {Magnusson, Maria},
title = {{Short on camera geometry and camera calibration}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2010},
type = {Other academic},
number = {LiTH-ISY-R, 3070},
address = {Sweden},
}
In this work we present a region detector, an adaptation to range data of the popular Maximally Stable Extremal Regions (MSER) region detector. We call this new detector Maximally Robust Range Regions (MRRR). We apply the new detector to real range data captured by a commercially available laser range camera. Using this data we evaluate the repeatability of the new detector and compare it to some other recently published detectors. The presented detector shows a repeatability which is better or the same as the best of the other detectors. The MRRR detector also offers additional data on the detected regions. The additional data could be crucial in applications such as registration or recognition.
@techreport{diva2:325006,
author = {Viksten, Fredrik and Forss\'{e}n, Per-Erik},
title = {{Maximally Robust Range Regions}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2010},
type = {Other academic},
number = {LiTH-ISY-R, 2961},
address = {Sweden},
}
This document is an addendum to the main text in A local geometry-based descriptor for 3D data applied to object pose estimation by Fredrik Viksten and Klas Nordberg. This addendum gives proofs for propositions stated in the main document. This addendum also details how to extract information from the fourth order tensor refered to as S22 in the main document.
@techreport{diva2:325000,
author = {Nordberg, Klas and Viksten, Fredrik},
title = {{A local geometry based descriptor for 3D data:
Addendum on rank and segment extraction}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2010},
type = {Other academic},
number = {LiTH-ISY-R, 2951},
address = {Sweden},
}
This report gives an overview and motivates the design of a C++ framework for object recognition using channel-coded feature maps. The code was produced in connection to the work on my PhD thesis Channel-Coded Feature Maps for Object Recognition and Machine Learning. The package contains algorithms ranging from basic image processing routines to specific complex algorithms for creating channel-coded feature maps through piecewise polynomials. Much emphasis has been put in creating a flexible framework using virtual interfaces. This makes it easy e.g.~to switch between different image primitives detectors or learning methods in an object recognizer. Some common design choices include an image class with a convenient but fast pixel access, a configurable assert macro for error handling and a common base class for object ownership management. The main computer vision algorithms are channel-coded feature maps (CCFMs) including their derivatives, single-sided colored lines, object detection using an abstract hypothesize-verify framework and tracking and pose estimation using locally weighted regression and CCFMs. The code is considered as having alpha status at best. It is available under the GNU General Public License (GPL) and is mainly intended for future research on the subject.
@techreport{diva2:288558,
author = {Jonsson, Erik},
title = {{Object Recognition using Channel-Coded Feature Maps: C++ Implementation Documentation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2008},
type = {Other academic},
number = {LiTH-ISY-R, 2838},
address = {Sweden},
}
One major goal of the COSPAL project is to develop an artificial cognitive system architecture with the capability of exploratory learning. Exploratory learning is a strategy that allows to apply generalization on a conceptual level, resulting in an extension of competences. Whereas classical learning methods aim at best possible generalization, i.e., concluding from a number of samples of a problem class to the problem class itself, exploration aims at applying acquired competences to a new problem class. Incremental or online learning is an inherent requirement to perform exploratory learning.
Exploratory learning requires new theoretic tools and new algorithms. In the COSPAL project, we mainly investigate reinforcement-type learning methods for exploratory learning and in this paper we focus on its algorithmic aspect. Learning is performed in terms of four nested loops, where the outermost loop reflects the user-reinforcement-feedback loop, the intermediate two loops switch between different solution modes at symbolic respectively sub-symbolic level, and the innermost loop performs the acquired competences in terms of perception-action cycles. We present a system diagram which explains this process in more detail.
We discuss the learning strategy in terms of learning scenarios provided by the user. This interaction between user (’teacher’) and system is a major difference to most existing systems where the system designer places his world model into the system. We believe that this is the key to extendable robust system behavior and successful interaction of humans and artificial cognitive systems.
We furthermore address the issue of bootstrapping the system, and, in particular, the visual recognition module.We give some more in-depth details about our recognition method and how feedback from higher levels is implemented. The described system is however work in progress and no final results are available yet. The available preliminary results that we have achieved so far, clearly point towards a successful proof of the architecture concept.
@techreport{diva2:302803,
author = {Felsberg, Michael and Wiklund, Johan and Jonsson, Erik and Moe, Anders and Granlund, Gösta},
title = {{Exploratory Learning Structure in Artificial Cognitive Systems}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2006},
type = {Other academic},
number = {LiTH-ISY-R, 2738},
address = {Sweden},
}
To summarize, the VISATEC project was initiated to combine the specific scientific competencies of the research groups at CAU and LiU, together with the industrial view on vision applications, in order to develop novel, more robust algorithms for object localization and recognition. This goal was achieved by a two-fold strategy, whereby on the one hand more robust basic algorithms were developed and on the other hand a method for the combination of these algorithms was devised. In particular, the latter confirmed the consortium’s belief that an appropriate combination of a number of basic algorithms will lead to more robust results than a single method could do.
However, the multi-cue integration is just one algorithm of many that were developed in the VISATEC project. All developed algorithms are described in some detail in the remainder of this report. An overview of the respective publications can be found in appendix.
Despite some difficulties that were encountered on the way, we as a consortium feel that the VISATEC project was a success. That this is not only our opinion reflects in the outcome of the final review. We believe that the work that was done during these three years of the project, not only furthered our understanding of the matter, but also added to the knowledge within the scientific community and showed new possibilities for industrial vision applications.
@techreport{diva2:288604,
author = {Sommer, Gerald and Granlund, Gösta and Granert, Oliver and Krause, Martin and Nordberg, Klas and Perwass, Christian and Söderberg, Robert and Viksten, Fredrik and Chavarria, Marco},
title = {{Information Society Technologies (IST) programme:
Final Report}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2005},
type = {Other academic},
number = {, },
address = {Sweden},
}
This report introduces a robust contour descriptor for view-based object recognition. In recent years great progress has been made in the field of view based object recognition mainly due to the introduction of texture based features such as SIFT and MSER. Although these are remarkably successful for textured objects, they have problems with man-made objects with little or no texture. For such objects, either explicit geometrical models, or contour and shading based features are also needed. This report introduces a robust contour descriptor which we hope can be combined with texture based features to obtain object recognition systems that work in a wider range of situations. Each detected contour is described as a sequence of line and ellipse segments, both which have well defined geometrical transformations to other views. The feature detector is also quite fast, this is mainly due to the idea of first detecting chains of contour points, these chains are then split into line segments, which are later either kept, grouped into ellipses or discarded. We demonstrate the robustness of the feature detector with a repeatability test under general homography transformations of a planar scene. Through the repeatability test, we find that using ellipse segments instead of lines, where this is appropriate improves repeatability. We also apply the features in a robotic setting where object appearances are learned by manipulating the objects.
@techreport{diva2:288582,
author = {Forssen, Per-Erik and Moe, Anders},
title = {{Contour Descriptors for View-Based Object Recognition}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2005},
type = {Other academic},
number = {LiTH-ISY-R, 2706},
address = {Sweden},
}
The MATLAB/C program take version 3.1 is a program for simulation of X-ray projections from 3D volume data. It is based on an older C version by Muller-Merbach as well as an extended C version by Turbell. The program can simulate 2D X-ray projections from 3D objects. These data can then be input to 3D reconstruction algorithms. Here however, we only demonstrate a couple of 2D reconstruction algorithms, written in MATLAB. Simple MATLAB examples show how to generate the take projections followed by subsequent reconstruction. Compared to the old take version, the C code have been carefully revised. A preliminary, rather untested feature of using a polychromatic X-ray source with different energy levels was already included in the old take version. The current polychromatic feature X-ray is however carefully tested. For example, it has been compared with the results from the program described by Malusek et al. We also demonstrate experiments with a polychromatic X-ray source and a Plexiglass object giving the beam-hardening artefact. Detector sensitivity for different energy levels is not included in take. However, in section~\refsec:realexperiment, we describe a technique to include the detector sensitivity into the energy spectrum. Finally, an experiment with comparison of real and simulated data were performed. The result wasn't completely successful, but we still demonstrate it. Contemporary analytical reconstruction methods for helical cone-beam CT have to be designed to handle the Long Object Problem. Normally, a moderate amount of over-scanning is sufficient for reconstruction of a certain Region-of-interest (ROI). Unfortunately, for iterative methods, it seems that the useful ROI will diminish for every iteration step. The remedies proposed here are twofold. Firstly, we use careful extrapolation and masking of projection data. Secondly, we generate and utilize projection data from incompletely reconstructed volume parts, which is rather counter-intuitive and contradictory to our initial assumptions. The results seem very encouraging. Even voxels close to the boundary in the original ROI are as well enhanced by the iterative loop as the middle part.
@techreport{diva2:288581,
author = {Seger, Olle and Seger, Maria Magnusson},
title = {{The MATLAB/C program take - a program for simulation of X-ray projections from 3D volume data. Demonstration of beam-hardening artefacts in subsequent CT reconstruction.}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2005},
type = {Other academic},
number = {LiTH-ISY-R, 2682},
address = {Sweden},
}
@techreport{diva2:262476,
author = {Jonsson, Erik and Felsberg, Michael and Granlund, Gösta},
title = {{Incremental Associative Learning}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2005},
type = {Other academic},
number = {LiTH-ISY-R, 2691},
address = {Sweden},
}
@techreport{diva2:257175,
author = {Forss\'{e}n, Per-Erik and Johansson, Björn and Granlund, Gösta},
title = {{Learning under Perceptual Aliasing}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2005},
type = {Other academic},
number = {, },
address = {Sweden},
}
This report describes a method to detect and recognize objects from 3D laser radar data. The method is based on local descriptors computed from triplets of planes that are estimated from the data set. Each descriptor that is computed on query data is compared with descriptors computed on object model data to get a hypothesis of object class and pose. An hypothesis is either verified or rejected using a similarity measure between the model data set and the query data set.
@techreport{diva2:257173,
author = {Johansson, Björn and Moe, Anders},
title = {{Object Recognition in 3D Laser Radar Data using Plane triplets}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2005},
type = {Other academic},
number = {LiTH-ISY-R, 2708},
address = {Sweden},
}
In this paper we propose a new operator which combines advantages of monogenic scale-space and Gaussian scale-space, of the monogenic signal and the structure tensor. The gradient energy tensor (GET) defined in this paper is based on Gaussian derivatives up to third order using different scales. These filters are commonly available, separable, and have an optimal uncertainty. The response of this new operator can be used like the monogenic signal to estimate the local amplitude, the local phase, and the local orientation of an image, but it also allows to measure the coherence of image regions as in the case of the structure tensor
@techreport{diva2:288639,
author = {Felsberg, Michael},
title = {{The GET Operator}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2633},
address = {Sweden},
}
This report evaluates the stability of two image interest point detectors, star-pattern points and points based on the fourth order tensor. The Harris operator is also included for comparison. Different image transformations are applied and the repeatability of points between a reference image and each of the transformed images are computed. The transforms are plane rotation, change in scale, change in view, and change in lightning conditions. We conclude that the result largely depends on the image content. The star-pattern points and the fourth order tensor models the image as locally straight lines, while the Harris operator is based on simple/non-simple signals. The two methods evaluated here perform equally well or better than the Harris operator if the model is valid, and perform worse otherwise.
@techreport{diva2:288612,
author = {Johansson, Björn and Söderberg, Robert},
title = {{A Repeatability Test for Two Orientation Based Interest Point Detectors}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2606},
address = {Sweden},
}
In this paper we present a new and efficient method to implement robust smoothing of low-level signal features: B-spline channel smoothing. This method consists of three steps: encoding of the signal features into channels, averaging of the channels, and decoding of the channels. We show that linear smoothing of channels is equivalent to robust smoothing of the signal features, where we make use of quadratic B-splines to generate the channels. The linear decoding from B-spline channels allows to derive a robust error norm which is very similar to Tukey's biweight error norm. Channel smoothing is superior to iterative robust smoothing implementations like non-linear diffusion, bilateral filtering, and mean-shift approaches for four reasons: it has lower computational complexity, it is easy to implement, it chooses the global minimum error instead of the nearest local minimum, and it can also be used on non-linear spaces, such as orientation space. In the experimental part of the paper we compare channel smoothing and the previously mentioned three other approaches for 2D orientation data.
@techreport{diva2:288553,
author = {Felsberg, Michael and Forssen, Per-Erik and Scharr, Hanno},
title = {{Efficient Robust Smoothing of Low-Level Signal Features}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2619},
address = {Sweden},
}
Most contemporary CT-sytems employ non-exact methods. This treatise reports on how these methods could be transformed from non-exact to exact reconstruction methods by means of iterative post-processing. Compared to traditional algebraic reconstruction (ART) we expect much faster convergence (in theory quadratic), due to a much improved first guess and the fact that each iteration includes the same non-exact analytical reconstruction step as the first guess.
@techreport{diva2:288551,
author = {Danielsson, Per-Erik and Seger, Maria Magnusson},
title = {{Combining Fourier and iterative methods in computer tomography:
Analysis of an iteration scheme. The 2D-case}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2634},
address = {Sweden},
}
In this paper we present a new method to implement a robust estimator: B-spline channel smoothing. We show that linear smoothing of channels is equivalent to a robust estimator, where we make use of the channel representation based upon quadratic B-splines. The linear decoding from B-spline channels allows to derive a robust error norm which is very similar to Tukey's biweight error norm. Using channel smoothing instead of iterative robust estimator implementations like non-linear diffusion, bilateral filtering, and mean-shift approaches is advantageous since channel smoothing is faster, it is easy to implement, it chooses the global minimum error instead of the nearest local minimum, and it can also be used on non-linear spaces, such as orientation space. As an application, we implemented orientation smoothing and compared it to the other three approaches.
@techreport{diva2:288549,
author = {Felsberg, Michael and Forssen, Per-Erik and Scharr, Hanno},
title = {{B-Spline Channel Smoothing for Robust Estimation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2579},
address = {Sweden},
}
This report describes a fourth order tensor defined on projective spaces which can be used for the representation of medium-level features, e.g., one or more oriented segments. The tensor has one part which describes what type of local structures are present in a region, and one part which describes where they are located. This information can be used, e.g., to represent multiple orientations, corners, and line-endings. The tensor can be defined for arbitrary signal dimension, but the presentation focuses on the properties of the fourth order tensor for the case of 2D and 3D image data. A method for estimating the proposed tensor representation by means of simple computations directly from the structure tensor is presented. Given a simple matrix representation of the tensor, it can be shown that there is a direct correspondence between the number of oriented segments and the rank of the matrix provided that the number of segments is three or less. The \publication also presents techniques for extracting information about the oriented segments which the tensor represent. Finally, it shown that a small set of coefficients can be computed from the proposed tensor which are invariant to changes of the coordinate system.
@techreport{diva2:288343,
author = {Nordberg, Klas},
title = {{A fourth order tensor for representation of orientation and position of oriented segments}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2587},
address = {Sweden},
}
This report describes how blob features can be used for automatic estimation of the fundamental matrix from two perspective projections of a 3D scene. Blobs are perceptually salient, homogeneous, compact image regions. They are represented by their average colour, area, centre of gravity and inertia matrix. Coarse blob correspondences are found by voting using colour and local similarity transform matching on blob pairs. We then do RANSAC sampling of the coarse correspondences, and weight each estimate according to how well the approximating conics and colours of two blobs correspond. The initial voting significantly reduces the number of RANSAC samples required, and the extra information besides position, allows us to reject false matches more accurately than in RANSAC using point features.
@techreport{diva2:288340,
author = {Forssen, Per-Erik and Moe, Anders},
title = {{Automatic Estimation of Epipolar Geometry from Blob Features}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2620},
address = {Sweden},
}
This report brings together a novel approach to some computer vision problems and a particular algorithmic development of the Landweber iterative algorithm. The algorithm solves a class of high-dimensional, sparse, and constrained least-squares problems, which arise in various computer vision learning tasks, such as object recognition and object pose estimation. The algorithm has recently been applied to these problems, but it has been used rather heuristically. In this report we describe the method and put it on firm mathematical ground. We consider a convexly constrained weighted least-squares problem and propose for its solution a projected Landweber method which employs oblique projections onto the closed convex constraint set. We formulate the problem, present the algorithm and work out its convergence properties, including a rate-of-convergence result. The results are put in perspective of currently available projected Landweber methods. The application to supervised learning is described, and the method is evaluated in a function approximation experiment.
@techreport{diva2:244368,
author = {Johansson, Björn and Elfving, Tommy and Kozlov, Vladimir and Censor, Yair and Granlund, Gösta},
title = {{The Application of an Oblique-Projected Landweber Method to a Model of Supervised Learning}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2004},
type = {Other academic},
number = {LiTH-ISY-R, 2623},
address = {Sweden},
}
Image intensity gradients can be encoded in a 2-dimensional channel representation. This report discusses the computation of such gradient channel matrices and what information can be extracted from it. In particular does this representation allow to distinguish multiple orientations and magnitudes in a single representation. It is shown that this can be used to recover orientation very accurately. This holds in particular near orientation discontinuities where classical orientation estimation fails.
@techreport{diva2:288613,
author = {Spies, Hagen},
title = {{Gradient Channel Matrices for Orientation Estimation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2003},
type = {Other academic},
number = {LiTH-ISY-R, 2540},
address = {Sweden},
}
In this paper we address the problem of appropriately representing the intrinsic dimensionality of image neighborhoods. This dimensionality describes the degrees of freedom of a local image patch and it gives rise to some of the most often applied corner and edge detectors. It is common to categorize the intrinsic dimensionality (iD) to three distinct cases: i0D, i1D, and i2D. Real images however contain combinations of all three dimensionalities which has to be taken into account by a continuous representation. Based on considerations of the structure tensor, we derive a cone-shaped iD-space which leads to a probabilistic point of view to the estimation of intrinsic dimensionality.
@techreport{diva2:288326,
author = {Felsberg, Michael and Kruger, Norbert},
title = {{A Probabilistic Definition of Intrinsic Dimensionality for Images}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2003},
type = {Other academic},
number = {LiTH-ISY-R, 2520},
address = {Sweden},
}
The use of linear filters, i.e. convolutions, inevitably introduces dependencies in the uncertainties of the filter outputs. Such non-vanishing covariances appear both between different positions and between the responses from different filters (even at the same position). This report describes how these covariances between the output of linear filters can be computed. We then examine the induced covariance matrices for some typical 1D and 2D filters. Finally the total noise reduction properties are examined.
@techreport{diva2:288311,
author = {Spies, Hagen},
title = {{Covariances of Linear Filter Outputs in Computer Vision}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2003},
type = {Other academic},
number = {LiTH-ISY-R, 2504},
address = {Sweden},
}
This report describes a view-based method for object recognition and estimation of object pose in still images. The method is based on feature vector matching and clustering. A set of interest points, in this case star-patterns, is detected and combined into pairs. A pair of patches, centered around each point in the pair, is extracted from a local orientation image. The patch orientation and size depends on the relative positions of the points, which make them invariant to translation, rotation, and scale. Each pair of patches constitutes a feature vector. The method is demonstrated on a number of real images.
@techreport{diva2:257174,
author = {Johansson, Björn and Moe, Anders},
title = {{Patch-Duplets for Object Recognition and Pose Estimation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2003},
type = {Other academic},
number = {LiTH-ISY-R, 2553},
address = {Sweden},
}
The structure tensor has been used mainly for representation of local orientation in spaces of arbitrary dimensions, where the eigenvectors represent the orientation and the corresponding eigenvalues indicate the type of structure which is represented. Apart from being local, the structure tensor may be referred to as "object centered" since it describes the corresponding structure relative to a local reference system. This paper proposes that the basic properties of the structure tensor can be extended to a tensor defined in a projective space rather than in a local Euclidean space. The result, the "projective tensor", is symmetric in the same way as the structure tensor, and also uses the eigensystem to carry the relevant information. However, instead of orientation, the projective tensor represents geometrical primitives such as points, lines, and planes (depending on dimensionality of the underlying space). Furthermore, this representation has the useful property of mapping the operation of forming the affine hull of points and lines to tensor summation, e.g., the sum of two projective tensors which represent two points amounts to a projective tensor that represent the line which passes through the two points, etc. The projective tensor may be referred to as "view centered" since each tensor, which still may be defined on a local scale, represents a geometric primitive relative to a global image based reference system. This implies that two such tensors may be combined, e.g., using summation, in a meaningful way over large regions.
@techreport{diva2:288635,
author = {Nordberg, Klas},
title = {{The structure tensor in projective spaces}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2424},
address = {Sweden},
}
In this paper we consider the channel representation based upon quadratic B-splines from a statistical point of view. Interpreting the channel representation as a kernel method for estimating probability density functions, we establish a channel algebra which allows to perform basic algebraic operations on measurements directly in the channel representation. Furthermore, as a central point, we identify the smoothing of channel values with a robust estimator, or equivalently, a diffusion process.
@techreport{diva2:288621,
author = {Felsberg, Michael and Scharr, Hanno and Forssen, Per-Erik},
title = {{The B-Spline Channel Representation: Channel Algebra and Channel Based Diffusion Filtering}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2461},
address = {Sweden},
}
Next generation helical cone-beam CT will feature pitches around 80 mm. It is predicted that reconstruction algorithms to be used in these machines with still rather modest cone angles may not necessarily be exact, but rather have an emphasis on simplicity and speed. The PImethods are a family of non-exact algorithms, all of which are based on complete data capture with a detector collimated to the Tam-window followed by rebinning to obliquely parallel ray geometry. The non-exactness is identified as inconsistency in the space invariant one-dimensional ramp-filtering step. It is shown that this inconsistency can be reduced resulting in significant improvement in image quality and increased tolerance for higher pitch and cone angle. A short theoretical background for the PI-methods is given but the algorithms themselves are not given in any detail. A set of experiments on mathematical phantoms illustrate (among other things) how the amount of artefacts grow with increasing cone angles.
@techreport{diva2:288610,
author = {Danielsson, Per-Erik and Seger, Maria Magnusson and Turbell, Henrik},
title = {{The PI-methods for Helical Cone-Beam Tomography}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2428},
address = {Sweden},
}
In this report we describe how an RGB component colour image may be expanded into a set of channel images, and how the original colour image may be reconstructed from these. We also demonstrate the effect of averaging on the channel images and how it differs from conventional averaging. Finally we demonstrate how boundaries can be detected as a change in the confidence of colour state.
@techreport{diva2:288277,
author = {Forssen, Per-Erik and Granlund, Gösta and Wiklund, Johan},
title = {{Channel Representation of Colour Images}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2418},
address = {Sweden},
}
In this paper we address the topics of scale-space and phase-based signal processing in a common framework. The involved linear scale-space is no longer based on the Gaussian kernel but on the Poisson kernel. The resulting scale-space representation is directly related to the monogenic signal, a 2D generalization of the analytic signal. Hence, the local phase arises as a natural concept in this framework which results in several advanced relationships that can be used in image processing.
@techreport{diva2:288275,
author = {Felsberg, Michael and Sommer, Gerald},
title = {{The Poisson Scale-Space: A Unified Approach to Phase-Based Image Processing in Scale-Space}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2453},
address = {Sweden},
}
This report describes how the choice of kernel affects a non-parametric density estimation. Methods for accurate localisation of peaks in the estimated densities are developed for Gaussian and cos2 kernels. The accuracy and robustness of the peak localisation methods are studied with respect to noise, number of samples, and interference between peaks. Although the peak localisation is formulated in the framework of non-parametric density estimation, the results are also applicable to associative learning with localised responses.
@techreport{diva2:288272,
author = {Forssen, Per-Erik},
title = {{Observations Concerning Reconstructions with Local Support}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2425},
address = {Sweden},
}
The channel representation is a simple yet powerful representation of scalars and vectors. It is especially suited for representation of several scalars at the same time without mixing them up.
This report is partly intended to serve as a simple illustration of the channel representation. The report shows how the channels can be used to represent multiple orientations in two dimensions. The idea is to make a channel representation of the local orientation angle computed from the image gradient. The representation basically becomes an orientation histogram with overlapping bins.
The channel histogram is compared with the orientation tensor, which is another representation of orientation. The performance comparable to tensors in the simple signal case, but decreases slightly for increasing number of channels. The channel histogram outperforms the tensors on non-simple signals.
@techreport{diva2:257179,
author = {Johansson, Björn},
title = {{Representing Multiple Orientations in 2D with Orientation Channel Histograms}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2475},
address = {Sweden},
}
A new architecture for learning systems has been developed. A number of particular design features in combination result in a very high performance and excellent robustness. The architecture uses a monopolar channel information representation. The channel representation implies a partially overlapping mapping of signals into a higher-dimensional space, such that a flexible but continuous restructuring mapping can be made. The high-dimensional mapping introduces locality in the information representation, which is directly available in wavelets or filter outputs. Single level maps using this representation can produce closed decision regions, thereby eliminating the need for costly back-propagation. The monopolar property implies that data only utilizes one polarity, say positive values, in addition to zero, allowing zero to represent no information. This leads to an efficient sparse representation.
The processing mode of the architecture is association where the mapping of feature inputs onto desired state outputs is learned from a representative training set. The sparse monopolar representation together with locality, using individual learning rates, allows a fast optimization, as the system exhibits linear complexity. Mapping into multiple channels gives a strategy to use confidence statements in data, leading to a low sensitivity to noise in features. The result is an architecture allowing systems with a complexity of some hundred thousand features described by some hundred thousand samples to be trained in typically less than an hour. Experiments that demonstrate functionality and noise immunity are presented. The architecture has been applied to the design of hyper complex operations for view centered object recognition in robot vision.
@techreport{diva2:257178,
author = {Granlund, Gösta and Forss\'{e}n, Per-Erik and Johansson, Björn},
title = {{HiperLearn:
A High Performance Learning Architecture}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2002},
type = {Other academic},
number = {LiTH-ISY-R, 2409},
address = {Sweden},
}
This report defines the rank complement of a diagonalizable matrix (i.e. a matrix which can be brought to a diagonal form by means of a change of basis) as the interchange of the range and the null space. Given a diagonalizable matrix A there is in general no unique matrix Ac which has a range equal to the null space of A and a null space equal to the range of A, only matrices of full rank have a unique rank complement; the zero matrix. Consequently, the rank complement operation is not a distinct operation, but rather a characterization of any operation which makes an interchange of the range and the null space. One particular rank complement operation is introduced here, which eventually leads to an implementation of rank complement operations in terms of polynomials in A. The main result is that for each possible rank r of A there is a polynomial in A which evaluates to a matrix Ac which is a rank complement of A. The report provides explicit expressions for matrix polynomials which compute a rank complement of a symmetric matrix. These results are then generalized to the case of diagonalizable matrices. Finally, a Matlab function is described that implements a rank complement operation based on the results derived.
@techreport{diva2:288596,
author = {Nordberg, Klas and Farnebäck, Gunnar},
title = {{Rank complement of diagonalizable matrices using polynomial functions}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2001},
type = {Other academic},
number = {LiTH-ISY-R, 2369},
address = {Sweden},
}
This report describes a novel window matching technique. We perform window matching by transforming image data into sparse features, and apply a computationally efficient matching technique in the sparse feature space. The gain in execution time for the matching is roughly 10 times compared to full window matching techniques such as SSD, but the total execution time for the matching also involves an edge filtering step. Since the edge responses may be used for matching of several regions, the proposed matching technique is increasingly advantageous when the number of regions to keep track of increases, and when the size of the search window increases. The technique is used in a real-time ego-motion estimation system in the WITAS project. Ego-motion is estimated by tracking of a set of structure points, i.e. regions that do not have the aperture problem. Comparisons with SSD, with regard to speed and accuracy are made.
@techreport{diva2:288544,
author = {Forssen, Per-Erik},
title = {{Window Matching using Sparse Templates}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2001},
type = {Other academic},
number = {LiTH-ISY-R, 2392},
address = {Sweden},
}
This report starts with an introduction to the concepts active perception, reactive systems, and state dependency, and to fundamental aspects of perception such as the perceptual aliasing problem, and the number-of-percepts vs. number-of-states trade-off. We then introduce finite state machines, and extend them to accommodate active perception. Finally we demonstrate a state-transition mechanism that is applicable to autonomous navigation.
@techreport{diva2:288318,
author = {Forssen, Per-Erik},
title = {{Autonomous Navigation using Active Perception}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2001},
type = {Other academic},
number = {LiTH-ISY-R, 2395},
address = {Sweden},
}
This report describes an idea based on the work in [1], where an algorithm for learning automatic representation of visual operators is presented. The algorithm in [1] uses canonical correlation to find a suitable subspace in which the signal is invariant to some desired properties. This report presents a related approach specially designed for classification problems. The goal is to find a subspace in which the signal is invariant within each class, and at the same time compute the class representation in that subspace. This algorithm is closely related to the one in [1], but less computationally demanding, and it is shown that the two algorithms are equivalent if we have equal number of training samples for each class. Even though the new algorithm is designed for pure classification problems it can still be used to learn visual operators as will be shown in the experiment section. [1] M. Borga. Learning Multidimensional Signal Processing. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, 1998. Dissertation No 531, ISBN 91-7219-202-X.
@techreport{diva2:288281,
author = {Johansson, Björn},
title = {{On Classification: Simultaneously Reducing Dimensionality and Finding Automatic Representation using Canonical Correlation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2001},
type = {Other academic},
number = {LiTH-ISY-R, 2375},
address = {Sweden},
}
This report is a complement to the working document [1], where a sparse associative network is described. This report shows that the net learning rule in [1] can be viewed as the solution to a weighted least squares problem. This means that we can apply the theory framework of least squares problems, and compare the net rule with some other iterative algorithms that solve the same problem. The learning rule is compared with the gradient search algorithm and the RPROP algorithm in a simple synthetic experiment. The gradient rule has the slowest convergence while the associative and the RPROP rules have similar convergence. The associative learning rule has a smaller initial error than the RPROP rule though.
It is also shown in the same experiment that we get a faster convergence if we have a monopolar constraint on the solution, i.e. if the solution is constrained to be non-negative. The least squares error is a bit higher but the norm of the solution is smaller, which gives a smaller interpolation error.
The report also discusses a generalization of the least squares model, which include other known function approximation models.
[1] G Granlund. Paralell Learning in Artificial Vision Systems: Working Document. Dept. EE, Linköping University, 2000
@techreport{diva2:257177,
author = {Johansson, Björn},
title = {{On Sparse Associative Networks:
A Least Squares Formulation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2001},
type = {Other academic},
number = {LiTH-ISY-R, 2368},
address = {Sweden},
}
@techreport{diva2:288619,
author = {Granlund, Gösta H.},
title = {{The Use of Dynamics to Establish Knowledge of Invariant Structure}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2240},
address = {Sweden},
}
This report describes an experimental still image coder that grew out of a project in the graduate course ``Advanced Video Coding'' in spring 2000. The project has investigated the idea to use local orientation histograms in fractal coding. Instead of performing a correlation-like grey-level matching of image regions, the block search is made by matching feature histograms of the block contents. The feature investigated in this report is local orientation, but in principle other features could be used as well. In its current state the coder does not outperform state of the art still image coders, but the block-search strategy seems promising, and will probably prove useful in several other applications.
@techreport{diva2:288616,
author = {Forssen, Per-Erik and Johansson, Björn},
title = {{Fractal Coding by Means of Local Feature Histograms}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2295},
address = {Sweden},
}
This report describes the principles of an algorithm developed within the WITAS project. The goal of the WITAS project is to build an autonomous helicopter that can navigate autonomously, using differential GPS, GIS-data of the underlying terrain (elevation models and digital orthophotographs) and a video camera. Using differential GPS and other non-visual sensory equipment, the system is able to obtain crude estimates of its position and heading direction. These estimates can be refined by matching of camera-images and the on-board GIS-data. This refinement process, however is rather time consuming, and will thus only be made every once in a while. For real-time refinement of camera position and heading, the system will iteratively update the estimates using frame to frame correspondence only. In each frame a sparse set of image displacement estimates is calculated, and from these the perspective in the current image can be found. Using the calculated perspective and knowledge of the camera parameters, new values of camera position and heading can be obtained. The resultant camera position and heading can exhibit a slow drift if the original alignment was not perfect, and thus a corrective alignment with GIS-data should be performed once every minute or so.
@techreport{diva2:288566,
author = {Forssen, Per-Erik},
title = {{Updating Camera Location and Heading using a Sparse Displacement Field}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2318},
address = {Sweden},
}
@techreport{diva2:288548,
author = {Granlund, Gösta H.},
title = {{Channel Representation of Information}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2236},
address = {Sweden},
}
This report describes a technique to detect curvature. The technique uses local polynomial fitting on a local orientation description of an image. The idea is based on the theory of rotational symmetries which describes curvature, circles, star-patterns etc. The local polynomial fitting is shown to be equivalent to calculating partial derivatives on a lowpass version of the local orientation. The new method can therefore be very efficiently implemented both in the singlescale case and in the multiscale case.
@techreport{diva2:288546,
author = {Johansson, Björn},
title = {{Curvature Detection using Polynomial Fitting on Local Orientation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2312},
address = {Sweden},
}
@techreport{diva2:288331,
author = {Granlund, Gösta H.},
title = {{Context Controllable Linkage Models}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2238},
address = {Sweden},
}
One important problem in image analysis is the localization of a template in a larger image. Applications where the solution of this problem can be used include: tracking, optical flow, and stereo vision. The matching method studied here solve this problem by defining a new similarity measurement between a template and an image neighborhood. This similarity is computed for all possible integer positions of the template within the image. The position for which we get the highest similarity is considered to be the match. The similarity is not necessarily computed using the original pixel values directly, but can of course be derived from higher level image features.
The similarity measurement can be computed in differentways and the simplest approach are correlation-type algorithms. Aschwanden and Guggenb¨uhl [2] have done a comparison between such algorithms. One of best and simplest algorithms they tested is normalized cross-correlation (NCC). Therefore this algorithm has been used to compare with the PAIRS algorithm that is developed by the author and described in this text. It uses a completely different similarity measurement based on sets of bits extracted from the template and the image.
This work is done withinWITAS which is a project dealing with UAV’s (unmanned aerial vehicles). Two specific applications of the developed template matching algorithm have been studied.
- One application is tracking of cars in video sequences from a helicopter.
- The other one is computing optical flow in such video sequences in order to detect moving objects, especially vehicles on roads.
The video from the helicopter is in color (RGB) and this fact is used in the presented tracking algorithm. The PAIRS algorithm have been applied to these two applications and the results are reported.
A part of this text will concern a general approach to template matching called Maximum Entropy Matching (MEM) that is developed here. The main idea of MEM is that the more data we compare on a computer the longer it takes and therefore the data that we compare should have maximum average information, that is, maximum entropy. We will see that this approach can be useful to create template matching algorithms which are in the order of 10 times faster then correlation (NCC) without decreasing the performance.
@techreport{diva2:288327,
author = {Lundberg, Frans},
title = {{Maximum Entropy Matching: An Approach to Fast Template Matching}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2313},
address = {Sweden},
}
@techreport{diva2:288317,
author = {Granlund, Gösta},
title = {{The Dichotomy of Strategies for Spatial-Cognitive Information Processing}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2241},
address = {Sweden},
}
Some image patterns, e.g. circles, hyperbolic curves, star patterns etc., can be described in a compact way using local orientation. The features mentioned above is part of a family of patterns called rotational symmetries. This theory can be used to detect image patterns from the local orientation in double angle representation of an images. Some of the rotational symmetries are described originally from the local orientation without being designed to detect a certain feature. The question is then: given a description in double angle representation, what kind of image features does this description correspond to? This 'inverse', or backprojection, is not unambiguous - many patterns has the same local orientation description. This report answers this question for the case of rotational symmetries and also for some other descriptions.
@techreport{diva2:288305,
author = {Johansson, Björn},
title = {{Backprojection of Some Image Symmetries Based on a Local Orientation Description}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2311},
address = {Sweden},
}
@techreport{diva2:288280,
author = {Granlund, Gösta H.},
title = {{Learning Through Response-Driven Association}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2237},
address = {Sweden},
}
@techreport{diva2:288276,
author = {Granlund, Gösta H.},
title = {{Low Level Image Interpretation Using Associative Mapping}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2239},
address = {Sweden},
}
This survey contains links and facts to a number of projects on content based search in image databases around the world today. The main focus is on what kind of image features is used but also the user interface and the users possibility to interact with the system (i.e. what 'visual language' is used).
@techreport{diva2:257176,
author = {Johansson, Björn},
title = {{A Survey on:
Contents Based Search in Image Databases}},
institution = {Linköping University, Department of Electrical Engineering},
year = {2000},
type = {Other academic},
number = {LiTH-ISY-R, 2215},
address = {Sweden},
}
@techreport{diva2:288602,
author = {Reed, Todd},
title = {{A Baseline System for Image and Map Registration using Sparse Hierarchical Features}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1999},
type = {Other academic},
number = {LiTH-ISY-R, 2138},
address = {Sweden},
}
Computer vision systems used in autonomous mobile vehicles are typically linked to higher-level deliberation processes. One important aspect of this link is how to connect, or anchor, the symbols used at the higher level to the objects in the vision system that these symbols refer to. Anchoring is complicated by the fact that the vision data are inherently affected by uncertainty. We propose an anchoring technique that uses fuzzy sets to represent the uncertainty in the perceptual data. We show examples where this technique allows a deliberative system to reason about the objects (cars) detected by a vision system embarked in an unmanned helicopter, in the framework of the Witas project.
@techreport{diva2:288592,
author = {Andersson, Thord and Coradeschi, Silvia and Saffiotti, Alessandro},
title = {{Fuzzy matching of visual cues in an unmanned airborne vehicle}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1999},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:288634,
author = {Borga, Magnus and Knutsson, Hans},
title = {{An Adaptive Stereo Algorithm Based on Canonical Correlation Analysis}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1998},
type = {Other academic},
number = {LiTH-ISY-R, 2013},
address = {Sweden},
}
@techreport{diva2:288629,
author = {Granlund, Gösta},
title = {{Does Vision Inevitably Have to be Active?}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1998},
type = {Other academic},
number = {LiTH-ISY-R, 2068},
address = {Sweden},
}
This report introduces a signal processing strategy for depth segmentation and scene reconstruction that incorporates occlusion as a natural component. The work aims to maximize the use of connectivity in the temporal domain as much as possible under the condition that the scene is static and that the camera motion is known. An object behind the foreground is reconstructed using the fact that different parts of the object have been seen in different images in the sequence. One of the main ideas in the reported work is the use of a spatiotemporal certainty volume c(x) with the same dimension as the input spatiotemporal volume s(x), and then use c(x) as a 'blackboard' for rejecting already segmented image structures. The segmentation starts with searching for image structures in the foreground, eliminates their occluding influence, and then proceeds. Normalized convolution, which is a Weighted Least Mean Square technique for filtering data with varying spatial reliability, is used for all filtering. High spatial resolution near object borders is achieved and only neighboring structures with similar depth supports each other.
@techreport{diva2:288324,
author = {Ulvklo, Morgan and Granlund, Gösta H. and Knutsson, Hans},
title = {{Adaptive Reconstruction using Multiple Views}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1998},
type = {Other academic},
number = {LiTH-ISY-R, 2036},
address = {Sweden},
}
This paper presents our general strategy for designing learning machines as well as a number of particular designs. The search for methods allowing a sufficient level of adaptivity are based on two main principles: 1. Simple adaptive local models and 2. Adaptive model distribution. Particularly important concepts in our work is mutual information and canonical correlation. Examples are given on learning feature descriptors, modeling disparity, synthesis of a global 3-mode model and a setup for reinforcement learning of online video coder parameter control.
@techreport{diva2:288299,
author = {Knutsson, Hans and Borga, Magnus and Landelius, Tomas},
title = {{Learning Multidimensional Signal Processing}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1998},
type = {Other academic},
number = {LiTH-ISY-R, 2039},
address = {Sweden},
}
A recursive method to condense general multidimensional FIR-filters into a sequence of simple kernels with mainly one dimensional extent has been worked out. Convolver networks adopted for 2, 3 and 4D signals is presented and the performance is illustrated for spherically separable quadrature filters. The resulting filter responses are mapped to a non biased tensor representation where the local tensor constitutes a robust estimate of both the shape and the orientation (velocity) of the neighbourhood. A qualitative evaluation of this General Sequential Filter concept results in no detectable loss in accuracy when compared to conventional FIR (Finite Impulse Response) filters but the computational complexity is reduced several orders in magnitude. For the examples presented in this paper the attained speed-up is 5, 25 and 300 times for 2D, 3D and 4D data respectively The magnitude of the attained speed-up implies that complex spatio-temporal analysis can be performed using standard hardware, such as a powerful workstation, in close to real time. Due to the soft implementation of the convolver and the tree structure of the sequential filtering approach the processing is simple to reconfigure for the outer as well as the inner (vector length) dimensionality of the signal. The implementation was made in AVS (Application Visualization System) using modules written in C.
@techreport{diva2:288295,
author = {Andersson, Mats and Wiklund, Johan and Knutsson, Hans},
title = {{Sequential Filter Trees for Efficient 2D 3D and 4D Orientation Estimation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1998},
type = {Other academic},
number = {LiTH-ISY-R, 2070},
address = {Sweden},
}
This paper presents a novel algorithm for analysis of stochastic processes. The algorithm can be used to find the required solutions in the cases of principal component analysis (PCA), partial least squares (PLS), canonical correlation analysis (CCA) or multiple linear regression (MLR). The algorithm is iterative and sequential in its structure and uses on-line stochastic approximation to reach an equilibrium point. A quotient between two quadratic forms is used as an energy function and it is shown that the equilibrium points constitute solutions to the generalized eigenproblem.
@techreport{diva2:288565,
author = {Borga, Magnus and Landelius, Tomas and Knutsson, Hans},
title = {{A Unified Approach to PCA, PLS, MLR and CCA}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1997},
type = {Other academic},
number = {LiTH-ISY-R, 1992},
address = {Sweden},
}
@techreport{diva2:288560,
author = {Karlholm, Jörgen},
title = {{Tracking of occluded targets in head-up display sequences}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1997},
type = {Other academic},
number = {LiTH-ISY-R, 1993},
address = {Sweden},
}
@techreport{diva2:288304,
author = {Ulvklo, Morgan and Uppsäll, Magnus},
title = {{Adaptive Reconstruction using Multiple Views - Results and Applications}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1997},
type = {Other academic},
number = {, },
address = {Sweden},
}
This paper reviews an existing algorithm for adaptive control based on explicit criterion maximization (ECM) and presents an extended version suited for reinforcement learning tasks. Furthermore, assumptions under which the algorithm convergences to a local maxima of a long term utility function are given. Such convergence theorems are very rare for reinforcement learning algorithms working with continuous state and action spaces. A number of similar algorithms, previously suggested to the reinforcement learning community, are briefly surveyed in order to give the presented algorithm a place in the field. The relations between the different algorithms is exemplified by checking their consistency on a simple problem of linear quadratic regulation (LQR).
@techreport{diva2:288584,
author = {Landelius, Tomas and Knutsson, Hans},
title = {{Reinforcement Learning Adaptive Control and Explicit Criterion Maximization}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1996},
type = {Other academic},
number = {LiTH-ISY-R, 1829},
address = {Sweden},
}
A number of success stories have been told where reinforcement learning has been applied to problems in continuous state spaces using neural nets or other sorts of function approximators in the adaptive critics. However, the theoretical understanding of why and when these algorithms work is inadequate. This is clearly exemplified by the lack of convergence results for a number of important situations. To our knowledge only two such results been presented for systems in the continuous state space domain. The first is due to Werbos and is concerned with linear function approximation and heuristic dynamic programming. Here no optimal strategy can be found why the result is of limited importance. The second result is due to Bradtke and deals with linear quadratic systems and quadratic function approximators. Bradtke's proof is limited to ADHDP and policy iteration techniques where the optimal solution is found by a number of successive approximations. This paper deals with greedy techniques, where the optimal solution is directly aimed for. Convergence proofs for a number of adaptive critics, HDP, DHP, ADHDP and ADDHP, are presented. Optimal controllers for linear quadratic regulation (LQR) systems can be found by standard techniques from control theory but the assumptions made in control theory can be weakened if adaptive critic techniques are employed. The main point of this paper is, however, not to emphasize the differences but to highlight the similarities and by so doing contribute to a theoretical understanding of adaptive critics.
@techreport{diva2:288542,
author = {Landelius, Tomas and Knutsson, Hans},
title = {{Greedy adaptive critics for LPQ [dvs LQR] problems:
Convergence Proofs}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1996},
type = {Other academic},
number = {LiTH-ISY-R, 1896},
address = {Sweden},
}
This paper presents a novel algorithm for finding the solution of the generalized eigenproblem where the matrices involved contain expectation values from stochastic processes. The algorithm is iterative and sequential to its structure and uses on-line stochastic approximation to reach an equilibrium point. A quotient between two quadratic forms is suggested as an energy function for this problem and is shown to have zero gradient only at the points solving the eigenproblem. Furthermore it is shown that the algorithm for the generalized eigenproblem can be used to solve three important problems as special cases. For a stochastic process the algorithm can be used to find the directions for maximal variance, covariance, and canonical correlation as well as their magnitudes.
@techreport{diva2:288332,
author = {Knutsson, Hans and Borga, Magnus and Landelius, Tomas},
title = {{Generalized Eigenproblem for Stochastic Process Covariances}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1996},
type = {Other academic},
number = {LiTH-ISY-R, 1916},
address = {Sweden},
}
A scheme for performing generalized convolutions is presented. A flexible convolver, which runs on standard workstations, has been implemented. It is designed for maximum throughput and flexibility. The implementation incorporates spatio-temporal convolutions with configurable vector combinations. It can handle general multilinear operations, i.e. tensor operations on multidimensional data of any order. The input data and the kernel coefficients can be of arbitrary vector length. The convolver is configurable for IIR filters in the time dimension. Other features of the implemented convolver are scattered kernel data, region of interest and subsampling. The implementation is done as a C-library and a graphical user interface in AVS (Application Visualization System).
@techreport{diva2:288320,
author = {Wiklund, Johan and Knutsson, Hans},
title = {{A Generalized Convolver}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1996},
type = {Other academic},
number = {LiTH-ISY-R, 1830},
address = {Sweden},
}
This report documents work done at the request of the Swedish Defense Research Establishment. The studied problem is that of detecting point-shaped targets, i.e. targets whose only significant property is that of being very small, in a cluttered environment. Three approaches to the problem have been considered. The first one, based on motion compensation, was rejected at an early stage due to expected problems with robustness and computational demands. The second method, based on background modeling with principal components, turned out successful and has been studied in depth, including discussion of various extensions and improvements of the presented algorithm. Finally, a Wiener filter approach has also turned out successful, including an approximation with separable filters. The methods have been tested on sequences obtained by an IR sensor. While both the two latter approaches work well on the test sequences, the Wiener filter is simpler and computationally less expensive than the background modeling. On the other hand, the background modeling is likely to have better possibilities for extensions and improvements.
@techreport{diva2:288286,
author = {Farnebäck, Gunnar and Knutsson, Hans and Granlund, Gösta},
title = {{Detection of point-shaped targets}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1996},
type = {Other academic},
number = {LiTH-ISY-R, 1921},
address = {Sweden},
}
Two new reinforcement learning algorithms are presented. Both use a binary tree to store simple local models in the leaf nodes and coarser global models towards the root. It is demonstrated that a meaningful partitioning into local models can only be accomplished in a fused space consisting of both input and output. The first algorithm uses a batch like statistic procedure to estimate the reward functions in the fused space. The second one uses channel coding to represent the output- and input vectors allowing a simple iterative algorithm based on competing subsystems. The behaviors of both algorithms are illustrated in a preliminary experiment.
@techreport{diva2:288282,
author = {Landelius, Tomas and Borga, Magnus and Knutsson, Hans},
title = {{Reinforcement Learning Trees}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1996},
type = {Other academic},
number = {LiTH-ISY-R, 1828},
address = {Sweden},
}
@techreport{diva2:288633,
author = {Wilson, Roland and Knutsson, Hans},
title = {{Seeing Things II}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1995},
type = {Other academic},
number = {LiTH-ISY-R, 1787},
address = {Sweden},
}
This paper presents an algorithm for estimation of local curvature from gradients of a tensor field that represents local orientation. The algorithm is based on an operator representation of the orientation tensor, which means that change of local orientation corresponds to a rotation of the eigenvectors of the tensor. The resulting curvature descriptor is a vector that points in the direction of the image in which the local orientation rotates anti-clockwise and the norm of the vector is the inverse of the radius of curvature. Two coefficients are defined that relate the change of local orientation with either curves or radial patterns.
@techreport{diva2:288599,
author = {Nordberg, Klas and Knutsson, Hans and Granlund, Gösta},
title = {{Local Curvature from Gradients of the Orientation Tensor Field}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1995},
type = {Other academic},
number = {LiTH-ISY-R, 1783},
address = {Sweden},
}
This paper presents a novel learning algorithm that finds the linear combination of one set of multi-dimensional variates that is the best predictor, and at the same time finds the linear combination of another set which is the most predictable. This relation is known as the canonical correlation and has the property of being invariant with respect to affine transformations of the two sets of variates. The algorithm successively finds all the canonical correlations beginning with the largest one. It is shown that canonical correlations can be used in computer vision to find feature detectors by giving examples of the desired features. When used on the pixel level, the method finds quadrature filters and when used on a higher level, the method finds combinations of filter output that are less sensitive to noise compared to vector averaging.
@techreport{diva2:288567,
author = {Knutsson, Hans and Borga, Magnus and Landelius, Tomas},
title = {{Learning Canonical Correlations}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1995},
type = {Other academic},
number = {LiTH-ISY-R, 1761},
address = {Sweden},
}
This paper presents novel algorithms for finding the singular value decomposition (SVD) of a general covariance matrix by stochastic approximation. General in the sense that also non-square, between sets, covariance matrices are dealt with. For one of the algorithms, convergence is shown using results from stochastic approximation theory. Proofs of this sort, establishing both the point of equilibrium and its domain of attraction, have been reported very rarely for stochastic, iterative feature extraction algorithms.
@techreport{diva2:288273,
author = {Landelius, Tomas and Knutsson, Hans and Borga, Magnus},
title = {{On-Line Singular Value Decomposition of Stochastic Process Covariances}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1995},
type = {Other academic},
number = {LiTH-ISY-R, 1762},
address = {Sweden},
}
We apply the 3D-orientation tensor representation to construct an object tracking algorithm. 2D-line normal flow is estimated by computing the eigenvector associated with the largest eigenvalue of 3D (two spatial dimensions plus time) tensors with a planar structure. Object's true 2D velocity is computed by averaging tensors with consistent normal flows, generating a 3D line representation that corresponds to a 2D point in motion. Flow induced by camera rotation is compensated for by ignoring points with velocity consistent with the ego-rotation. A region-of-interest growing process based on motion consistency generates estimates of object size and position.
@techreport{diva2:288608,
author = {Karlholm, Jörgen and Westelius, Carl-Johan and Westin, Carl-Fredrik and Knutsson, Hans},
title = {{Object Tracking Based on the Orientation Tensor Concept}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1994},
type = {Other academic},
number = {LiTH-ISY-R, 1658},
address = {Sweden},
}
A robust, fast and general method for estimation of object properties is proposed. It is based on a representation of theses properties in terms of channels. Each channel represents a particular value of a property, resembling the activity of biological neurons. Furthermore, each processing unit, corresponding to an artificial neuron, is a linear perceptron which operates on outer products of input data. This implies a more complex space of invariances than in the case of first order characteristic without abandoning linear theory. In general, the specific function of each processing unit has to to be learned and a fast and simple learning rule is presented. The channel representation, the processing structure and the learning rule has been tested on stereo image data showing a cube with various 3D positions and orientations. The system was able to learn a channel representation for the horizontal position, the depth, and the orientation of the cube, each property invariant to the other two.
@techreport{diva2:288329,
author = {Nordberg, Klas and Granlund, Gösta and Knutsson, Hans},
title = {{Representation and Learning of Invariance}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1994},
type = {Other academic},
number = {LiTH-ISY-R, 1552},
address = {Sweden},
}
@techreport{diva2:288308,
author = {Westin, Carl-Fredrik and Westelius, Carl-Johan and Wiklund, Johan and Knutsson, Hans and Granlund, Gösta},
title = {{ESPRIT Basic Research Action 7108, Vision as Process, DR.B.2:
Integration of Multi-level Control Loops and FOA}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1994},
type = {Other academic},
number = {, },
address = {Sweden},
}
A robust, general and computationally simple reinforcement learning system is presented. It uses a channel representation which is robust and continuous. The accumulated knowledge is represented as a reward prediction function in the outer product space of the input- and output channel vectors. Each computational unit generates an output simply by a vector-matrix multiplication and the response can therefore be calculated fast. The response and a prediction of the reward are calculated simultaneously by the same system, which makes TD-methods easy to implement if needed. Several units can cooperate to solve more complicated problems. A dynamic tree structure of linear units is grown in order to divide the knowledge space into a sufficiently number of regions in which the reward function can be properly described. The tree continuously tests split- and prune criteria in order to adapt its size to the complexity of the problem.
@techreport{diva2:288288,
author = {Borga, Magnus and Knutsson, Hans},
title = {{A Binary Competition Tree for Reinforcement Learning}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1994},
type = {Other academic},
number = {LiTH-ISY-R, 1623},
address = {Sweden},
}
This paper addresses the idea of learning by reinforcement, within the theory of behaviorism. The reason for this choice is its generality and especially that the reinforcement learning paradigm allows systems to be designed, which can improve their behavior beyond that of their teacher. The role of the teacher is to define the reinforcement function, which acts as a description of the problem the machine is to solve. Gained knowledge is represented by a behavior probability density function which is approximated with a number of normal distributions, stored in the nodes of a binary tree. It is argued that a meaningful partitioning into local models can only be accomplished in a fused space consisting of both stimuli and responses. Given a stimulus, the system searches for responses likely to result in highly reinforced decisions by treating the sum of the two normal distributions on each level in the tree as a distribution describing the system's behavior at that resolution. The resolution of the response, as well as the tree growing and pruning processes, are controlled by a random variable based on the difference in performance between two consecutive levels in the tree. This results in a system that will never be content but will indefinitely continue to search for better solutions.
@techreport{diva2:288270,
author = {Landelius, Tomas and Knutsson, Hans},
title = {{A Dynamic Tree Structure for Incremental Reinforcement Learning of Good Behavior}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1994},
type = {Other academic},
number = {LiTH-ISY-R, 1628},
address = {Sweden},
}
The tensor representation has proven a successful tool as a mean to describe local multi-dimensional orientation. In this respect, the tensor representation is a map from the local orientation to a second order tensor. This paper investigates how variations of the orientation are mapped to variation of the tensor, thereby giving an explicit equivariance relation. The results may be used in order to design tensor based algorithms for extraction of image features defined in terms of local variations of the orientation, e.g. multi-dimensional curvature or circular symmetries. It is assumed that the variation of the local orientation can be described in terms of an orthogonal transformation group. Under this assumption a corresponding orthogonal transformation group, acting on the tensor, is constructed. Several correspondences between the two groups are demonstrated.
@techreport{diva2:288623,
author = {Nordberg, Klas and Knutsson, Hans and Granlund, Gösta},
title = {{On the Equivariance of the Orientation and the Tensor Field Representation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {LiTH-ISY-R, 1530},
address = {Sweden},
}
@techreport{diva2:288594,
author = {Larsen, Rasmus},
title = {{Thoughts on Bayesian Estimation of Motion Vector Fields}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {LiTH-ISY-R, 1521},
address = {Sweden},
}
@techreport{diva2:288587,
author = {Granlund, Gösta},
title = {{ESPRIT Project BRA 3038: Vision as Process, Final Report}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {LiTH-ISY-R, 1473},
address = {Sweden},
}
@techreport{diva2:288577,
author = {Westin, Carl-Fredrik and Westelius, Carl-Johan},
title = {{ESPRIT Basic Research Action 7108, Vision as Process, DR.B.1: Integration of Low-level FOA \texttt{\char`\\}\& Control Mechanisms}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:288569,
author = {et, Erik Granum},
title = {{ESPRIT Basic Research Action 7108, Vision as Process, Periodic progress report}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:288563,
author = {Wilson, Roland and Knutsson, Hans},
title = {{Seeing Things [1]}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {LiTH-ISY-R, 1467},
address = {Sweden},
}
Three-dimensional data processing is becoming more and more common. Typical operations are for example estimation of optical flow in video sequences and orientation estimation in 3-D MR images. This paper proposes an efficient approach to robust low level feature extraction for 3-D image analysis. In contrast to many earlier algorithms the methods proposed in this paper support the use of relatively complex models at the initial processing steps. The aim of this approach is to provide the means to handle complex events at the initial processing steps and to enable reliable estimates in the presence of noise. A limited basis filter set is proposed which forms a basis on the unit sphere and is related to spherical harmonics. From these basis filters, different types of orientation selective filters are synthesized. An interpolation scheme that provides a rotation as well as a translation of the synthesized filter is presented. The purpose is to obtain a robust and invariant feature extraction at a manageable computational cost.
@techreport{diva2:288342,
author = {Andersson, Mats T. and Knutsson, Hans},
title = {{Controllable 3-D Filters for Low Level Computer Vision}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {LiTH-ISY-R, 1526},
address = {Sweden},
}
@techreport{diva2:288290,
author = {Wiklund, Johan and Westin, Carl-Fredrik and Westelius, Carl-Johan},
title = {{AVS, Application Visualization System, Software Evaluation Report}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1993},
type = {Other academic},
number = {LiTH-ISY-R, 1469},
address = {Sweden},
}
@techreport{diva2:288624,
author = {Bårman, Håkan and Granlund, Gösta},
title = {{Hierarchical Feature Extraction for Computer-Aided Analysis of Mammograms}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {LiTH-ISY-R, 1448},
address = {Sweden},
}
@techreport{diva2:288561,
author = {Bårman, Håkan and Knutsson, Hans and Granlund, Gösta H.},
title = {{A Note on Estimation of Optical Flow and Acceleration}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {LiTH-ISY-I, 1313},
address = {Sweden},
}
@techreport{diva2:288339,
author = {Wiklund, Johan and Westelius, Carl-Johan and Knutsson, Hans},
title = {{Hierarchical Phase Based Disparity Estimation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {LiTH-ISY-I, 1327},
address = {Sweden},
}
This survey considers response generating systems that improve their behaviour using reinforcement learning. The difference between unsupervised learning, supervised learning, and reinforcement learning is described. Two general problems concerning learning systems are presented; the credit assignment problem and the problem of perceptual aliasing. Notations and some general issues concerning reinforcement learning systems are presented. Reinforcement learning systems are further divided into two main classes; memory mapping and projective mapping systems. Each of these classes is described and some examples are presented. Some other approaches are mentioned that do not fit into the two main classes. Finally some issues not covered by the surveyed articles are discussed, and some comments on the subject are made.
@techreport{diva2:288303,
author = {Borga, Magnus and Carlsson, Tomas},
title = {{A Survey of Current Techniques for Reinforcement Learning}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {LiTH-ISY-I, 1391},
address = {Sweden},
}
@techreport{diva2:288294,
author = {Westin, Carl-Fredrik},
title = {{ESPRIT Basic Research Action 3038, Vision as Process, DR.A.2.1: Model Support and Local FOA Control}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {, },
address = {Sweden},
}
The topic of this report is signal representation in the context of hierarchical image processing. An overview of hierarchical processing systems is included as well as a presentation of various approaches to signal representation, feature representation and feature extraction. It is claimed that image hierarchies based on feature extraction, so called feature hierarchies, demand a signal representation other than the standard spatial or linear representation used today. A new representation, the operator representation is developed. It is based on an interpretation of features in terms of signal transformations. This representation has no references to any spatial ordering of the signal element and also gives an explicit representation of signal features. Using the operator representation, a generalization of the standard phase concept in image processing is introduced. Based on the operator representation, two algorithms for extraction of feature values are presented. Both have the capability of generating phase invariant feature descriptors. It is claimed that the operator representation in conjunction with some appropriate feature extraction algorithm is well suited as a general framework for defining multi level feature hierarchies. The report contains an appendical chapter containing the mathematical details necessary to comprehend the presentation.
@techreport{diva2:288284,
author = {Nordberg, Klas},
title = {{Signal Representation and Signal Processing using Operators}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {LiTH-ISY-I, 1387},
address = {Sweden},
}
@techreport{diva2:288264,
author = {Westelius, Carl-Johan and Knutsson, Hans and Wiklund, Johan},
title = {{Robust Vergence Control Using Scale--Space Phase Information}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {LiTH-ISY-I, 1363},
address = {Sweden},
}
@techreport{diva2:288262,
author = {Westelius, Carl-Johan},
title = {{ESPRIT Basic Research Action 3038, Vision as Process, DS.A.2.1: Software for Model Support and Local FOA Control}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1992},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:288626,
author = {Westin, Carl-Fredrik and Knutsson, Hans},
title = {{ESPRI Basic Research Action 3038, Vision as Process, DR.A.1.2: Definition of feature generating procedures}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:288589,
author = {Wiklund, Johan and Knutsson, Hans and Wilson, Roland},
title = {{A Hierarchical Stereo Algorithm}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {LiTH-ISY-I, 1167},
address = {Sweden},
}
@techreport{diva2:288547,
author = {Bårman, Håkan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Using Principal Direction Estimates for Shape and Acceleration Description}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {LiTH-ISY-I, 1231},
address = {Sweden},
}
@techreport{diva2:288341,
author = {Westin, Carl-Fredrik and Knutsson, Hans},
title = {{Line Segmentation by Clustering in Möbius-Hough Space}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {LiTH-ISY-I, 1221},
address = {Sweden},
}
@techreport{diva2:288333,
author = {Westelius, Carl-Johan and Granlund, Gösta},
title = {{Integrated Analyzes-Control Structure for Robotic Systems}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:288298,
author = {Westelius, Carl-Johan and Knutsson, Hans},
title = {{ESPRIT Basic Research Action 3038, Vision as Process, DS.A.1.1: Preliminary Software for Feature Extraction}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {, },
address = {Sweden},
}
@techreport{diva2:288292,
author = {Wilson, Roland and Calway, Andrew and Pearson, Edward R. S.},
title = {{A generalised wavelet transform for Fourier analysis: The multiresolution Fourier transform and its application to image and audio signal analysis}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {LiTH-ISY-I, 1177},
address = {Sweden},
}
The problem of incorporating orientation selectivity into transforms which provide local frequency representation of image regions over a range of spatial scales is investigated. It is shown that this can be achieved if the local spectra are defined on a log-polar coordinate lattice and that by appropriate choice of window functions, the spectra can be designed to be steerable in arbitrary orientations. In addition, the resulting class of transforms can be defined to be invertible, be based on window functions having good localization in both the spatial and spatial frequency domains, and be efficiently implemented using FFT techniques. Results of using one such transform for linear feature extraction demonstrate its effectiveness when dealing with oriented features.
@techreport{diva2:288269,
author = {Calway, Andrew},
title = {{Incorporating Orientation Selectivity in Wavelet Transforms: For Multi--Resolution Fourier Analysis of Images}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1991},
type = {Other academic},
number = {LiTH-ISY-I, 1243},
address = {Sweden},
}
@techreport{diva2:288325,
author = {Westelius, Carl-Johan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Focus of Attention Control}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1990},
type = {Other academic},
number = {LiTH-ISY-I, 1140},
address = {Sweden},
}
@techreport{diva2:288319,
author = {Westin, Carl-Fredrik and Knutsson, Hans},
title = {{A Parameter Mapping for Line Segmentation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1990},
type = {Other academic},
number = {LiTH-ISY-I, 1151},
address = {Sweden},
}
@techreport{diva2:288293,
author = {Bårman, Håkan and Granlund, Gösta H. and Knutsson, Hans},
title = {{Hierarchical Curvature Estimation and Description}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1990},
type = {Other academic},
number = {LiTH-ISY-I, 1095},
address = {Sweden},
}
@techreport{diva2:288609,
author = {Bårman, Håkan and Knutsson, Hans and Granlund, Gösta H.},
title = {{Mechanisms for Striate Cortex Organization}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1989},
type = {Other academic},
number = {LiTH-ISY-I, 1020},
address = {Sweden},
}
@techreport{diva2:288606,
author = {Westin, Carl-Fredrik and Westelius, Carl-Johan},
title = {{Brain chaos. A feature or a bug?}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1989},
type = {Other academic},
number = {LiTH-ISY-I, 0990},
address = {Sweden},
}
This report is a survey of information representations in both biological and artificial neural networks. The correct information representation is crucial for the dynamics and the adaptation algorithms of neural networks. A number of examples of existing information representations are given.
@techreport{diva2:288541,
author = {Järvinen, Arto},
title = {{Information representation in neural networks -- A survey}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1989},
type = {Other academic},
number = {LiTH-ISY-I, 0994},
address = {Sweden},
}
@techreport{diva2:288328,
author = {Granlund, Gösta H.},
title = {{Image Processing Systems and Components}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1989},
type = {Other academic},
number = {LiTH-ISY-I, 1016},
address = {Sweden},
}
@techreport{diva2:288321,
author = {Granlund, Gösta H.},
title = {{Information Representation in Image Analysis Algorithms}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1989},
type = {Other academic},
number = {LiTH-ISY-I, 1017},
address = {Sweden},
}
@techreport{diva2:288313,
author = {Järvinen, Arto and Wiklund, Johan},
title = {{Study of information mapping in Kohonen--Networks}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1989},
type = {Other academic},
number = {LiTH-ISY-I, 0978},
address = {Sweden},
}
@techreport{diva2:288296,
author = {Granlund, Gösta H.},
title = {{Discriminant Functions, Linear Operations and Learning}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1989},
type = {Other academic},
number = {LiTH-ISY-I, 1015},
address = {Sweden},
}
@techreport{diva2:288646,
author = {Granlund, Gösta H.},
title = {{Integrated Analysis-Response Structures for Robotics Systems}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0932},
address = {Sweden},
}
@techreport{diva2:288640,
author = {Granlund, Gösta H.},
title = {{Magnitude Representation of Feature Variables}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0933},
address = {Sweden},
}
@techreport{diva2:288600,
author = {Bårman, Håkan and Haglund, Leif and Granlund, Gösta H.},
title = {{Context Dependent Hierarchical Image Processing for Remote Sensing Data, Part Two: Contextual Classification and Segmentation}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0924},
address = {Sweden},
}
@techreport{diva2:288338,
author = {Granlund, Gösta H. and Knutsson, Hans},
title = {{Compact Associative Representation of Structural Information}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0931},
address = {Sweden},
}
@techreport{diva2:288336,
author = {Bigun, Josef},
title = {{Impressions from Picture Processing in USA and Japan}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0892},
address = {Sweden},
}
@techreport{diva2:288334,
author = {Andersson, Mats and Granlund, Gösta H.},
title = {{A Hybrid Image Processing Architecture}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0929},
address = {Sweden},
}
The symmetries in a neighbourhood of a gray value image are modelled by conjugate harmonic function pairs. A harmonic function pair is utilized to represent a coordinate transformation defining a symmetry type. Inthis coordinate representation the image parts, which are symmetric with respect to the chosen function pair, have iso-gray value curves which are simple lines or parallel line patterns. The detection is modelled in thespecial Fourier domain corresponding to the new variables by minimizing an error function. It is shown that the minimiza.tion process ar detection of these patterns can be carried out for the whole image entirely in the spatial domain by convolutions. The convolution kernel is complex valued, as is the the result. The magnitudes of the result are shown to correspond to a well defi.ned certainty measure, while the orientation is the lea.st square estimate of an orientation in the Fourier transform corresponding to the harmonic coordinates. Applica tions to four symmetries a.re given. These are circular, linear, hyperbolic and parabolic symmetries. Experimental results a.re presented.
@techreport{diva2:288323,
author = {Bigun, Josef},
title = {{Detection of Linear Symmetry in Multiple Dimensions for Description of Local Orientation and Optical Flow}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 893},
address = {Sweden},
}
@techreport{diva2:288287,
author = {Granlund, Gösta H.},
title = {{Bi-Directionally Adaptive Models in Image Analysis}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0930},
address = {Sweden},
}
@techreport{diva2:288265,
author = {Granlund, Gösta H.},
title = {{Integrated Analysis-Response Structures for Robotics Systems}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1988},
type = {Other academic},
number = {LiTH-ISY-I, 0932},
address = {Sweden},
}
The problem of optimal detection of orientation in arbitrary neighborhoods is solved in the least squares sense. It is shown that this corresponds to fitting an axis in the Fourier domain of the n-dimensional neighborhood, the solution of which is a well known solution of a matrix eigenvalue problem. The eigenvalues are the variance or inertia with respect to the axes given by their respective eigen vectors. The orientation is taken as the axis given by the least eigenvalue. Moreover it is shown that the necessary computations can be pursued in the spatial domain without doing a Fourier transformation. An implementation for 2-D is presented. Two certainty measures are given corresponding to the orientation estimate. These are the relative or the absolute distances between the two eigenvalues, revealing whether the fitted axis is much better than an axis orthogonal to it. The result of the implementation is verified by experiments which confirm an accurate orientation estimation and reliable certainty measure in the presence of additive noise at high level as well as low levels.
@techreport{diva2:691493,
author = {Bigun, Josef},
title = {{Optimal Orientation Detection of Linear Symmetry}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1987},
type = {Other academic},
number = {LiTH-ISY-I, 828},
address = {Sweden},
}
@techreport{diva2:288607,
author = {Albregtsen, Fritz},
title = {{Enhancing Satellite Images of the Antarctic Snow and Ice Cover by Context Dependent Anisotropic Nonstationary Filtering.}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1987},
type = {Other academic},
number = {LiTH-ISY-I, 0852},
address = {Sweden},
}
@techreport{diva2:288274,
author = {Bigun, Josef},
title = {{Optimal Orientation Detection of Circular Symmetry.}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1987},
type = {Other academic},
number = {LiTH-ISY-I, 0871},
address = {Sweden},
}
A definition of central symmetry for local neighborhoods of 2-D images is given. A complete ON-set of centrally symmetric basis functions is proposed. The local neighborhoods are expanded in this basis. The behavior of coefficient spectrum obtained by this expansion is proposed to be the foundation of central symmetry parameters of the neighbqrhoods. Specifically examination of two such behaviors are proposed: Point concentration and line concentration of the energy spectrum. Moreover, the study of these types of behaviors of the spectrum are shown to be possible to do in the spatial domain.
@techreport{diva2:691498,
author = {Bigun, Josef and Granlund, Gösta H.},
title = {{Central Symmetry Modelling}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1986},
type = {Other academic},
number = {LiTH-ISY-I, 789},
address = {Sweden},
}
@techreport{diva2:288617,
author = {Bårman, Håkan and Granlund, Gösta H. and Knutsson, Hans and Näppä, L.},
title = {{Context Dependent Hierarchical Image Processing for Remote Sensing Data.}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1986},
type = {Other academic},
number = {LiTH-ISY-I, 0824},
address = {Sweden},
}
@techreport{diva2:288554,
author = {Granlund, Gösta H.},
title = {{Introduction to GOP Computer Vision.}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1986},
type = {Other academic},
number = {LiTH-ISY-I, 0849},
address = {Sweden},
}
@techreport{diva2:288310,
author = {Näppä, Lars and Granlund, Gösta H.},
title = {{Texture Analysis and Description.}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1985},
type = {Other academic},
number = {LiTH-ISY-I, 0775},
address = {Sweden},
}
@techreport{diva2:403796,
author = {Granlund, Gösta},
title = {{Images and Computers}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1984},
type = {Other academic},
number = {LiTH-ISY-I, 0701},
address = {Sweden},
}
@techreport{diva2:403809,
author = {Wilson, Roland and Granlund, Gösta},
title = {{The Uncertainty Principle in Image Processing}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1983},
type = {Other academic},
number = {LiTH-ISY-I, 0576},
address = {Sweden},
}
@techreport{diva2:403805,
author = {Wilson, Roland},
title = {{Uncertainty, Eigenvalue Problems and Filter Design}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1983},
type = {Other academic},
number = {LiTH-ISY-I, 0580},
address = {Sweden},
}
@techreport{diva2:403801,
author = {Wilson, Roland},
title = {{The Uncertainty Principle in Vision}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1983},
type = {Other academic},
number = {LiTH-ISY-I, 0581},
address = {Sweden},
}
@techreport{diva2:403800,
author = {Wilson, Roland},
title = {{Quad-Tree Predictive Coding:
A New Class of Image Data Compression Algorithms}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1983},
type = {Other academic},
number = {LiTH-ISY-I, 0609},
address = {Sweden},
}
@techreport{diva2:403798,
author = {Wilson, Roland},
title = {{A Class of Local Centroid Algorithms for Classification and Quantization in Spaces of Arbitrary Dimension}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1983},
type = {Other academic},
number = {LiTH-ISY-I, 0610},
address = {Sweden},
}
@techreport{diva2:288302,
author = {Wilson, Roland},
title = {{The Uncertainty Principle in Image Coding}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1983},
type = {Other academic},
number = {LiTH-ISY-I, 0579},
address = {Sweden},
}
Operators for extraction of local information are essential components in an image processing system. This paper concentrates on the design and evaluation of convolution kernel sets enabling easy estimation of local orientation and frequency.
Consideration of interpolation properties and the limiting effects of the uncertainty principle leads to the definition of an "i deal" quadrature filter function. An optimization procedure is utilized to produce pairs of convolution kernels which implement an approximation of the desired function. A number of optimization results are presented.
To evaluate the performance of the optimized kernels in an image processing task, a series of experiments have been carried out. Examples are given of local orientation and frequency estimates for images with different signal to noise ratios. An angle deviation measure is defined and avector averaging scheme is introduced to increase angle estimation accuracy. Using a OdB SNR testimage, orientation estimates are produced having an expected deviation of less than 7 degrees.
@techreport{diva2:319074,
author = {Knutsson, Hans},
title = {{Design of Convolution Kernels}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1982},
type = {Other academic},
number = {LiTH-ISY-I, 0557},
address = {Sweden},
}
@techreport{diva2:288540,
author = {Granlund, Gösta H.},
title = {{Hierarchical Distributed Data Structures and Operations}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1982},
type = {Other academic},
number = {LiTH-ISY-I, 0512},
address = {Sweden},
}
@techreport{diva2:288571,
author = {Granlund, Gösta H. and Knutsson, Hans and Hedlund, Martin},
title = {{Hierarchical Processing of Structural Information}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1981},
type = {Other academic},
number = {LiTH-ISY-I, 0481},
address = {Sweden},
}
@techreport{diva2:288309,
author = {Kunt, Murat},
title = {{Picture Coding with the General Operator Processor (GOP)}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1980},
type = {Other academic},
number = {LiTH-ISY-I, 0370},
address = {Sweden},
}
@techreport{diva2:288306,
author = {Knutsson, Hans},
title = {{3-D Reconstruction by Fourier Techniques with Error Estimates}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1978},
type = {Other academic},
number = {LiTH-ISY-I, 0214},
address = {Sweden},
}
@techreport{diva2:288337,
author = {Granlund, Gösta H.},
title = {{Computer Processing and Display of Chromosome Image Information}},
institution = {Linköping University, Department of Electrical Engineering},
year = {1973},
type = {Other academic},
number = {LiTH-ISY-I, 0023},
address = {Sweden},
}
Student theses
While markerless motion capture provided acceptable accuracy, no clear patterns emerged regarding the individual effects of surface properties on technique. This is most likely due to limitations such as sample size, lack of standardizing data set (players) across facilities, and limited control over player behavior. However, analyzing one individual's motion capture data across surfaces showed potential for distinguishing turning styles based on facility parameters.
The method in this thesis demonstrates the potential of markerless motion capture for injury prevention research in football. Despite inconclusive results on the individual facility parameter effects, the ability to distinguish player styles across surfaces suggests valuable future directions for investigating personalized risk factors and optimizing playing surfaces. Further research with larger, more diverse samples and a broader set of biomechanical and facility features could provide deeper insight into injury prevention strategies.
@mastersthesis{diva2:1848290,
author = {Rommel, Kaspar},
title = {{Influence of artificial turf on football technique using motion capture and 3D modelling}},
school = {Linköping University},
type = {{}},
year = {2024},
address = {Sweden},
}
This thesis investigates the seasonal predictive capabilities of Neural Radiance Fields (NeRF) applied to satellite images. Focusing on the utilization of satellite data, the study explores how Sat-NeRF, a novel approach in computer vision, per- forms in predicting seasonal variations across different months. Through compre- hensive analysis and visualization, the study examines the model’s ability to cap- ture and predict seasonal changes, highlighting specific challenges and strengths. Results showcase the impact of the sun on predictions, revealing nuanced details in seasonal transitions, such as snow cover, color accuracy, and texture represen- tation in different landscapes. The research introduces modifications to the Sat- NeRF network. The implemented versions of the network include geometrically rendered shadows, a signed distance function, and a month embedding vector, where the last version mentioned resulted in Planet-NeRF. Comparative evalua- tions reveal that Planet-NeRF outperforms prior models, particularly in refining seasonal predictions. This advancement contributes to the field by presenting a more effective approach for seasonal representation in satellite imagery analysis, offering promising avenues for future research in this domain.
@mastersthesis{diva2:1841942,
author = {Ingerstad, Erica and Kåreborn, Liv},
title = {{Planet-NeRF:
Neural Radiance Fields for 3D Reconstruction on Satellite Imagery in Season Changing Environments}},
school = {Linköping University},
type = {{LiTH-ISY-EX--24/5631--SE}},
year = {2024},
address = {Sweden},
}
This thesis explores the integration of deep learning-based depth estimation models with the ORB-SLAM3 framework to address challenges in monocular Simultaneous Localization and Mapping (SLAM), particularly focusing on pure rotational movements. The study investigates the viability of using pre-trained generic depth estimation networks, and hybrid combinations of these networks, to replace traditional depth sensors and improve scale accuracy in SLAM systems. A series of experiments are conducted outdoors, utilizing a custom camera setup designed to isolate pure rotational movements. The analysis involves assessing each model's impact on the SLAM process as well as performance indicators (KPIs) on both depth estimation and 3D tracking. Results indicate a correlation between depth estimation accuracy and SLAM performance, underscoring the potential of depth estimation models in enhancing SLAM systems. The findings contribute to the understanding of the role of monocular depth estimation in integrating with SLAM, especially in applications requiring precise spatial awareness for augmented reality.
@mastersthesis{diva2:1845865,
author = {Bladh, Daniel},
title = {{Deep Learning-Based Depth Estimation Models with Monocular SLAM:
Impacts of Pure Rotational Movements on Scale Drift and Robustness}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5630--SE}},
year = {2023},
address = {Sweden},
}
Automatic 3D reconstruction of birds can aid researchers in studying their behavior. Recently there has been an attempt to reconstruct a variety of birds from single-view images. However, the common murre's appearance is different from the birds that have been studied. Moreover, recent studies have focused on side views. This thesis studies the 3D reconstruction of the common murre from single-view top-view images. A template mesh is first optimized to fit a 3D scan. Then the result is used to optimize a species-specific mean from side-view images annotated with keypoints and silhouettes. The resulting mean mesh is used to initialize the optimization for top-down images. Using a mask loss, a pose prior loss, and a bone length loss that uses a mean vector from the side-view images improves the 3D reconstruction as rated by humans. Furthermore, the intersection over union (IoU) and percentage of correct keypoint (PCK), although used by other authors, are insufficient in a single-view top-view setting.
@mastersthesis{diva2:1779743,
author = {Hägerlind, Johannes},
title = {{3D-Reconstruction of the Common Murre}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5576--SE}},
year = {2023},
address = {Sweden},
}
The goal of this thesis is to use fringe-pattern phase analysis to calibrate the distortion of a camera lens. The benefit of using this method is that the distortion can be calculated using data from each individual pixel and the methodology does not need any model.
The phase used to calibrate the images is calculated in two different ways, either utilizing the monogenic signal or through fringe-pattern phase analysis.
The calibration approaches were also validated through different methods. Primarily by utilizing the Hough transform and calibrating simulated distortion. The thesis also introduces a validation approach utilizing the phase orientation calculated through the monogenic signal.
The thesis also implements different approaches such as flat field correction to limit the impact of the image sensor noise to mitigate the phase noise.
It is also investigated which fringe-pattern frequencies are best suited for calibration through comparative analysis. The comparative analysis identified problems with too high and low frequencies of the fringe-patterns when calibrating using fringe-pattern phase analysis.
@mastersthesis{diva2:1773375,
author = {Karlsson, Karl},
title = {{Camera Distortion Calibration through Fringe Pattern Phase Analysis}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5580--SE}},
year = {2023},
address = {Sweden},
}
In the digital age where video content is abundant, this thesis investigates the efficient adaptation of an existing video-language model (VLM) to new data. The research leverages CLIP, a robust language-vision model, for various video-related tasks including video retrieval. The study explores using pre-trained VLMs to extract video embeddings without the need for extensive retraining. The effectiveness of a smaller model using aggregation is compared with larger models and the application of logistic regression for few-shot learning on video embeddings is examined. The aggregation was done using both non-learning through mean-pooling and also by utilizing a transformer. The video-retrieval models were evaluated on the ActivityNet Captions dataset which contains long videos with dense descriptions while the linear probes were evaluated on ActivityNet200 a video classification dataset.
The study's findings suggest that most models improved when additional frames were employed through aggregation, leading to improved performance. A model trained with fewer frames was able to surpass those trained with two or four times more frames by instead using aggregation. The incorporation of patch dropout and the freezing of embeddings proved advantageous by enhancing performance and conserving training resources. Furthermore, using a linear probe showed that the extracted features were of high quality requiring only 2-4 samples per class to match the zero-shot performance.
@mastersthesis{diva2:1772807,
author = {Lindgren, Felix},
title = {{Efficient Utilization of Video Embeddings from Video-Language Models}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5592--SE}},
year = {2023},
address = {Sweden},
}
In the field of autonomous driving a common scenario is to apply deep learningmodels on camera feeds to provide information about the surroundings. A recenttrend is for such vision-based methods to be centralized, in that they fuse imagesfrom all cameras in one big model for a single comprehensive output. Designingand tuning such models is hard and time consuming, in both development andtraining. This thesis aims to reproduce the results of a paper about a centralizedvision-based model performing 3D object detection, called BEVDet. Additionalgoals are to ablate the technique of class balanced grouping and sampling usedin the model, to tune the model to improve generalization, and to change thedetection head of the model to a Transformer decoder-based head.
The findings include a successful reproduction of the results of the paper,while adding depth supervision to BEVDet establishes a baseline for the subsequentexperiments. An increasing validation loss during most of the training indicatesthat there is room for improvement in the generalization of the model. Severaldifferent methods are tested in order to resolve the increasing validation loss,but they all fail to do so. The ablation study shows that the class balanced groupingis important for the performance of the chosen configuration of the model,while the class balanced sampling does not contribute significantly. Without extensivetuning the replacement head gives performance similar to the PETR, themodel that the head is adapted from, but fails to match the performance of thebaseline model. In addition, the model with the Transformer decoder-based headshows a converging validation loss, unlike the baseline model.
@mastersthesis{diva2:1771747,
author = {Lidman, Erik},
title = {{Visual Bird's-Eye View Object Detection for Autonomous Driving}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5579--SE}},
year = {2023},
address = {Sweden},
}
In synthetic aperture radar (SAR) and inverse synthetic aperture radar (ISAR), an imaging radar emits electromagnetic waves of varying frequencies towards a target and the backscattered waves are collected. By either moving the radar antenna or rotating the target and combining the collected waves, a much longer synthetic aperture can be created. These radar measurements can be used to determine the radar cross-section (RCS) of the target and to reconstruct an estimate of the target. However, the reconstructed images will suffer from spectral leakage effects and are limited in resolution. Many methods of enhancing the images exist and some are based on deep learning. Most commonly the deep learning methods rely on high-resolution ground truth data of the scene to train a neural network to enhance the radar images. In this thesis, a method that does not rely on any high-resolution ground truth data is applied to train a convolutional neural network to enhance radar images. The network takes a conventional ISAR image subject to spectral leakage effects as input and outputs an enhanced ISAR image which contains much more defined features. New RCS measurements are created from the enhanced ISAR image and the network is trained to minimise the difference between the original RCS measurements and the new RCS measurements. A sparsity constraint is added to ensure that the proposed enhanced ISAR image is sparse. The synthetic training data consists of scenes containing point scatterers that are either individual or grouped together to form shapes. The scenes are used to create synthetic radar measurements which are then used to reconstruct ISAR images of the scenes. The network is tested using both synthetic data and measurement data from a cylinder and two aeroplane models. The network manages to minimise spectral leakage and increase the resolution of the ISAR images created from both synthetic and measured RCSs, especially on measured data from target models which have similar features to the synthetic training data.
The contributions of this thesis work are firstly a convolutional neural network that enhances ISAR images affected by spectral leakage. The neural network handles complex-valued signals as a single channel and does not perform any rescaling of the input. Secondly, it is shown that it is sufficient to calculate the new RCS for much fewer frequency samples and angular positions and compare those measurements to the corresponding frequency samples and angular positions in the original RCS to train the neural network.
@mastersthesis{diva2:1767511,
author = {Enåkander, Moltas},
title = {{ISAR Imaging Enhancement Without High-Resolution Ground Truth}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5572--SE}},
year = {2023},
address = {Sweden},
}
Detecting defects in industrially manufactured products is crucial to ensure their safety and quality. This process can be both expensive and error-prone if done manually, making automated solutions desirable. There is extensive research on industrial anomaly detection in images, but recent studies have shown that adding 3D information can increase the performance. This thesis aims to extend the 2D anomaly detection framework, PaDiM, to incorporate 3D information. The proposed methods combine RGB with depth maps or point clouds and the effects of using PointNet++ and vision transformers to extract features are investigated. The methods are evaluated on the MVTec 3D-AD public dataset using the metrics image AUROC, pixel AUROC and AUPRO, and on a small dataset collected with a Time-of-Flight sensor. This thesis concludes that the addition of 3D information improves the performance of PaDiM and vision transformers achieve the best results, scoring an average image AUROC of 86.2±0.2 on MVTec 3D-AD.
@mastersthesis{diva2:1766718,
author = {Bärudde, Kevin and Gandal, Marcus},
title = {{Industrial 3D Anomaly Detection and Localization Using Unsupervised Machine Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5569--SE}},
year = {2023},
address = {Sweden},
}
This thesis explores the application of Contrastive Language-Image Pre-Training (CLIP), a vision-language model, in an automated video surveillance system for anomaly detection. The ability of CLIP to perform zero-shot learning, coupled with its robustness against minor image alterations due to its lack of reliance on pixel-level image analysis, makes it a suitable candidate for this application.
The study investigates the performance of CLIP in tandem with various anomaly detection algorithms within a visual surveillance system. A custom dataset was created for video anomaly detection, encompassing two distinct views and two varying levels of anomaly difficulty. One view offers a more zoomed-in perspective, while the other provides a wider perspective. This was conducted to evaluate the capacity of CLIP to manage objects that occupy either a larger or smaller portion of the entire scene.
Several different anomaly detection methods were tested with varying levels of supervision, including unsupervised, one-class classification, and weakly- supervised algorithms, which were compared against each other. To create better separation between the CLIP embeddings, a metric learning model was trained and then used to transform the CLIP embeddings to a new embedding space.
The study found that CLIP performs effectively when anomalies take up a larger part of the image, such as in the zoomed-in view where some of the One- Class-Classification (OCC) and weakly supervised methods demonstrated superior performance. When anomalies take up a significantly smaller part of the image in the wider view, CLIP has difficulty distinguishing anomalies from normal scenes even using the transformed CLIP embeddings. For the wider view the results showed on better performance for the OCC and weakly supervised methods.
@mastersthesis{diva2:1765573,
author = {Gärdin, Christoffer},
title = {{Anomaly Detection with Machine Learning using CLIP in a Video Surveillance Context}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5564--SE}},
year = {2023},
address = {Sweden},
}
In the area of Traffic Sign Recognition (TSR), deep learning models are trained to detect and classify images of traffic signs. The amount of data available to train these models is often limited, and collecting more data is time-consuming and expensive. A possible complement to traditional data acquisition, is to generate synthetic images with a generative machine learning model. This thesis investigates the use of denoising diffusion probabilistic models for generating synthetic data of one or multiple traffic sign classes, when providing different amount of real images for that class (classes). In the few-sample method, the number of images used was from 1 to 1000, and zero images were used in the zero-shot method. The results from the few-sample method show that combining synthetic images with real images when training a traffic sign classifier, increases the performance in 3 out of 6 investigated cases. The results indicate that the developed zero-shot method is useful if further refined, and potentially could enable generation of realistic images of signs not seen in the training data.
@mastersthesis{diva2:1764694,
author = {Carlson, Johanna and Byman, Lovisa},
title = {{Generation of Synthetic Traffic Sign Images using Diffusion Models}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5563--SE}},
year = {2023},
address = {Sweden},
}
Today the process of sorting second-hand clothes and textiles is mostly manual. In this master’s thesis, methods for automating this process as well as improving the manual sorting process have been investigated. The methods explored include the automatic prediction of price and intended usage for second-hand clothes, as well as different types of image retrieval to aid manual sorting. Two models were examined: CLIP, a multi-modal model, and MAE, a self-supervised model. Quantitatively, the results favored CLIP, which outperformed MAE in both image retrieval and prediction. However, MAE may still be useful for some applications in terms of image retrieval as it returns items that look similar, even if they do not necessarily have the same attributes. In contrast, CLIP is better at accurately retrieving garments with as many matching attributes as possible. For price prediction, the best model was CLIP. When fine-tuned on the dataset used, CLIP achieved an F1-Score of 38.08 using three different price categories in the dataset. For predicting the intended usage (either reusing the garment or exporting it to another country) the best model managed to achieve an F1-Score of 59.04.
@mastersthesis{diva2:1763534,
author = {Hermansson, Simon},
title = {{Learning Embeddings for Fashion Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5567--SE}},
year = {2023},
address = {Sweden},
}
Point Cloud Registration with data measured from a photon-counting LIDAR sensor from a large distance (500 m - 1.5 km) is an expanding field. Data measuredfrom far is sparse and have low detail, which can make the registration processdifficult, and registering this type of data is fairly unexplored. In recent years,machine learning for point cloud registration has been explored with promisingresults. This work compares the performance of the point cloud registration algorithm Iterative Closest Point with state-of-the-art algorithms, with data froma photon-counting LIDAR sensor. The data was provided by the Swedish Defense Research Agency (FOI). The chosen state-of-the-art algorithms were thenon-learning-based Fast Global Registration and learning-based D3Feat and SpinNet. The results indicated that all state-of-the-art algorithms achieve a substantial increase in performance compared to the Iterative Closest Point method. Allthe state-of-the-art algorithms utilize their calculated features to obtain bettercorrespondence points and therefore, can achieve higher performance in pointcloud registration. D3Feat performed point cloud registration with the highestaccuracy of all the state-of-the-art algorithms and ICP.
@mastersthesis{diva2:1761482,
author = {Boström, Maja},
title = {{Point Cloud Registration using both Machine Learning and Non-learning Methods:
with Data from a Photon-counting LIDAR Sensor}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5558--SE}},
year = {2023},
address = {Sweden},
}
Image fusion is a technique that aims to combine semantic information from different source images into a new synthesized image that contains information from both source images. It is a technique that can be useful in many different areas, such as reconnaissance, surveillance and medical diagnostics. A crucial aspect of image fusion is finding important features from the source images and preserving these in the fused image. A possible method to find and preserve the features could be to utilize deep learning. This thesis trains and evaluates an unsupervised network on two new datasets created for the fusion of visual near infrared (VNIR) and long wave infrared (LWIR) images. Feature representations obtained from a pre-trained network are implemented in the loss function, followed by training and evaluation of that model as well. Both deep learning models are compared with results obtained from a traditional image fusion method. The trained models performed well whereas the traditional method performed better when evaluating dataset 1. The deep learning models did perform better on dataset 2 which contained images captured in daylight and dusk conditions. The resulting fused images from the deep learning approaches demonstrated better contrast compared to the fused images obtained by averaging. The additional feature representations obtained from the pre-trained network did not improve the results on any of the datasets. An explanation for these results could be that the loss function already helps to preserve the semantic information in the features.
@mastersthesis{diva2:1737202,
author = {Granqvist, Matilda},
title = {{Infrared and Visible Image Fusion with an Unsupervised Network}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5540--SE}},
year = {2023},
address = {Sweden},
}
With over 53 million articles and 11 million images, Wikipedia is the greatest encyclopedia in history. The number of users is equally significant, with daily views surpassing 1 billion. Such an enormous system needs automation of tasks to make it possible for the volunteers to maintain. When it comes to textual data, there is a system based on machine learning called ORES providing automation to tasks such as article quality estimation and article topic routing. A visual counterpart system also needs to be developed to support tasks such as vandalism detection in images and for a better understanding of the visual data of Wikipedia. Researchers from the Wikimedia Foundation identified a hindrance to implementing the visual counterpart of ORES: the images of Wikipedia lack topical metadata. Thus, this work aims to develop a deep learning model that classifies images into a set of topics, which have been pre-determined in parallel work. State-of-the-art image classification models and other methods to mitigate the existing class imbalance are used. The conducted experiments show, among others, that: using the data that considers the hierarchy of labels performs better; resampling techniques are ineffective at mitigating imbalance due to the high label concurrence; sample-weighting improves metrics; and that initializing parameters as pre-trained on ImageNet rather than randomly yields better metrics. Moreover, we find interesting outlier labels that, despite having fewer samples, obtain better performance metrics, which is believed to be either due to bias from pre-training or simply more signal in the label. The distribution of the visual data predicted by the models displayed. Finally, some qualitative examples of the model predictions to some images are presented, proving the ability of the model to find correct labels that are missing in the ground truth
@mastersthesis{diva2:1729493,
author = {Vieira Bernat, Matheus},
title = {{Topical Classification of Images in Wikipedia:
Development of topical classification models followed by a study of the visual content of Wikipedia}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5538--SE}},
year = {2023},
address = {Sweden},
}
Harness racing horses are exposed to high workload and consequently, they are at risk of joint injuries and lameness. In recent years, the interest in applications to improve animal welfare has increased and there is a demand for objective assessment methods that can enable early and robust diagnosis of injuries.
In this thesis, experiments were conducted on video recordings collected by a helmet camera mounted on the driver of a sulky. The aim was to take the first steps toward equine gait analysis by investigating how semantic segmentation and 3D reconstruction of such data could be performed. Since these were the first experiments made on this data, no expectations of the results existed in advance.
Manual pixel-wise annotations were created on a small set of extracted frames and a deep learning model for semantic segmentation was trained to localize the horse, as well as the sulky and reins. The results are promising and could probably be further improved by expanding the annotated dataset and using a larger image resolution. Structure-from-motion using COLMAP was performed to estimate the camera motion in part of a video recording. A method to filter out dynamic objects based on masks created from predicted segmentation maps was investigated and the results showed that the reconstruction was part-wise successful, but struggled when dynamic objects were not filtered out and when the equipage was moving at high speed along a straight stretch.
Overall the results are promising, but further development needs to be conducted to ensure robustness and conclude whether data collected by the investigated helmet camera configuration is suitable for equine gait analysis.
@mastersthesis{diva2:1729598,
author = {Hult, Evelina},
title = {{Toward Equine Gait Analysis:
Semantic Segmentation and 3D Reconstruction}},
school = {Linköping University},
type = {{LiTH-ISY-EX--23/5539--SE}},
year = {2023},
address = {Sweden},
}
This master thesis project was done together with Saab Dynamics in Linköping the spring of 2022 and aims to perform an online IMU-camera calibration using an AprilTag board. Experiments are conducted on two different types of datasets, the public dataset Euroc and internal datasets from Saab. The calibration is done iteratively by solving a series of nonlinear optimization problems without any initial knowledge of the sensor configuration. The method is largely based on work by Huang and collaborators. Other than just finding the transformation between the IMU and the camera, the biases in the IMU, and the time delay between the two sensors are also explored. By comparing the resulting transformation with Kalibr, the current state of the art offline calibration toolbox, it is possible to conclude that the model can find and correct for the biases in the gyroscope. Therefore it is important to include these biases in the model. The model is able to roughly find the time shift between the two sensors but has more difficulties correcting for it. The thesis also aims to explore ways of compiling a good dataset for calibration. Results show that it is desirable to avoid rapid movements as well as images gathered at distances from the AprilTag board that very a lot. Also, having a shorter exposure time is useful to not lose AprilTag detections.
@mastersthesis{diva2:1701458,
author = {Karlhede, Arvid},
title = {{Online Camera-IMU Calibration}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5524--SE}},
year = {2022},
address = {Sweden},
}
With advancements in space technology, remote sensing applications, and computer vision, significant improvements in the data describing our planet are seen today. Researchers want to gather different kinds of data and perform data fusion techniques between them to increase our understanding of the world. Two such data types are Electro-Optical images and Synthetic Aperture Radar images. For data fusion, the images need to be accurately aligned. Researchers have investigated methods for robustly and accurately registering these images for many years. However, recent advancements in imaging systems have made the problem more complex than ever.
Currently, the imaging satellites that capture information around the globe have achieved a resolution of less than a meter per pixel. There is an increase in signal complexity for high-resolution SAR images due to how the imaging system operates. Interference between waves gives rise to speckled noise and geometric distortions, making the images very difficult to interpret. This directly affects the image registration accuracy.
In this thesis, the complexity of the problem regarding registration between SAR and EO data was described, and methods for registering the images were investigated. The methods were feature- and area-based. The feature-based method used a KAZE filter and SURF descriptor. The method found many key points but few correct correspondences. The area-based methods used FFT and MI, respectively. FFT was deemed best for higher quality images, whereas MI better dealt with the non-linear intensity difference. More complex techniques, such as dense neural networks, were excluded. No method achieved satisfying results on the entire data set, but the area-based methods accomplished complementary results.
A conclusion was drawn that the distortions in the SAR images are too significant to register accurately using only CV algorithms. Since the area-based methods achieved good results on images excluding significant distortions, future work should focus on solving the geometrical errors and increasing the registration accuracy
@mastersthesis{diva2:1682316,
author = {Hansson, Niclas},
title = {{Investigation of Registration Methods for High Resolution SAR-EO Imagery}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5506--SE}},
year = {2022},
address = {Sweden},
}
In recent years, pictures from handheld devices such as smartphones have been increasingly utilized as a documentation tool by medical practitioners not trained to take professional photographs. Similarly to the other types of image modalities, the images should be taken in a way to capture the vital information in the region of interest. Nevertheless, image capturing cannot always be done as desired, so images may exhibit different blur types at the region of interest. Having blurry images does not serve medical purposes, therefore, the patients might have to schedule a second appointment several days later to retake the images. A solution to this problem is to create an algorithm which immediately after capturing an image determines if it is medically useful and notifies the user of the result. The algorithm needs to perform the analysis at a reasonable speed, and at best, with a limited number of operations to make the calculations directly in the smartphone device. A large number of medical images must be available to create such an algorithm. Medical images are difficult to acquire, and it is specifically difficult to acquire blurry images since they are usually deleted.
The main objective of this thesis is to determine the medical usefulness of images taken with smartphone cameras, using both machine learning and handcrafted algorithms, with a low number of floating point operations and a high performance. Seven different algorithms (one hand-crafted and six machine learned) are created and compared regarding both number of floating point operations and performance. Fast Walsh-Hadamard transforms are the basis of the hand-crafted algorithm. The employed machine learning algorithms are both based on common convolutional neural networks (MobileNetV3 and ResNet50) and on our own designs. The issue with the low number of medical images acquired is solved by training the machine learning models on a synthetic dataset, where the non-medically useful images are generated by applying blur on the medically useful images. These models do, however, undergo evaluation using a real dataset, containing medically useful images as well as non-medically useful images.
Our results implicate that a real-time determination of the medical usefulness of images is possible on handheld devices, since our machine learned model DeepLAD-Net reaches the highest accuracy with 42 · 106 floating point operations. In terms of accuracy, MobileNetV3-large is the second best model with31 times as many floating point operations as our best model.
@mastersthesis{diva2:1670428,
author = {Zahra, Hasseli and Raamen, Anwia Odisho},
title = {{Automatic Quality Assessment of Dermatology Images:
A Comparison Between Machine Learning and Hand-Crafted Algorithms}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5486--SE}},
year = {2022},
address = {Sweden},
}
The development of autonomous driving systems has been one of the most popular research areas in the 21st century. One key component of these kinds of systems is the ability to perceive and comprehend the physical world. Two techniques that address this are object detection and semantic segmentation. During the last decade, CNN based models have dominated these types of tasks. However, in 2021, transformer based networks were able to outperform the existing CNN approach, therefore, indicating a paradigm shift in the domain. This thesis aims to explore the use of a vision transformer, particularly a Swin Transformer, in an object detection and semantic segmentation framework, and compare it to a classical CNN on road scenes. In addition, since real-time execution is crucial for autonomous driving systems, the possibility of a parameter reduction of the transformer based network is investigated. The results appear to be advantageous for the Swin Transformer compared to the convolutional based network, considering both object detection and semantic segmentation. Furthermore, the analysis indicates that it is possible to reduce the computational complexity while retaining the performance.
@mastersthesis{diva2:1678704,
author = {Hardebro, Mikaela and Jirskog, Elin},
title = {{Transformer Based Object Detection and Semantic Segmentation for Autonomous Driving}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5487--SE}},
year = {2022},
address = {Sweden},
}
In recent years, the EU has observed a decrease in the stocks of certain fish species due to unrestricted fishing. To combat the problem, many fisheries are investigating how to automatically estimate the catch size and composition using sensors onboard the vessels. Yet, measuring the size of fish in marine imagery is a difficult task. The images generally suffer from complex conditions caused by cluttered fish, motion blur and dirty sensors.
In this thesis, we propose a novel method for automatic measurement of fish size that can enable measuring both visible and occluded fish. We use a Mask R-CNN to segment the visible regions of the fish, and then fill in the shape of the occluded fish using a U-Net. We train the U-Net to perform shape completion in a semi-supervised manner, by simulating occlusions on an open-source fish dataset. Different to previous shape completion work, we teach the U-Net when to fill in the shape and not by including a small portion of fully visible fish in the input training data.
Our results show that our proposed method succeeds to fill in the shape of the synthetically occluded fish as well as of some of the cluttered fish in real marine imagery. We achieve an mIoU score of 93.9 % on 1 000 synthetic test images and present qualitative results on real images captured onboard a fishing vessel. The qualitative results show that the U-Net can fill in the shapes of lightly occluded fish, but struggles when the tail fin is hidden and only parts of the fish body is visible. This task is difficult even for a human, and the performance could perhaps be increased by including the fish appearance in the shape completion task. The simulation-to-reality gap could perhaps also be reduced by finetuning the U-Net on some real occlusions, which could increase the performance on the heavy occlusions in the real marine imagery.
@mastersthesis{diva2:1677704,
author = {Gustafsson, Stina},
title = {{Learning to Measure Invisible Fish}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5517--SE}},
year = {2022},
address = {Sweden},
}
Object tracking can be done in numerous ways, where the goal is to track a target through all frames in a sequence. The ground truth bounding box is used to initialize the object tracking algorithm. Object tracking can be carried out on infrared imagery suitable for military applications to execute tracking even without illumination. Objects, such as aircraft, can deploy countermeasures to impede tracking. The countermeasures most often mainly impact one wavelength band. Therefore, using two different wavelength bands for object tracking can counteract the impact of the countermeasures. The dataset was created from simulations. The countermeasures applied to the dataset are flares and Directional Infrared Countermeasures (DIRCMs).
Different object tracking algorithms exist, and many are based on discriminative correlation filters (DCF). The thesis investigated the DCF-based trackers STRCF and ECO on the created dataset. The STRCF and the ECO trackers were analyzed using one and two wavelength bands. The following features were investigated for both trackers: grayscale, Histogram of Oriented Gradients (HOG), and pre-trained deep features.
The results indicated that the STRCF and the ECO trackers using two wavelength bands instead of one improved performance on sequences with countermeasures. The use of HOG, deep features, or a combination of both improved the performance of the STRCF tracker using two wavelength bands. Likewise, the performance of the ECO tracker using two wavelength bands was improved by the use of deep features. However, the negative aspect of using two wavelength bands and introducing more features is that it resulted in a lower frame rate.
@mastersthesis{diva2:1676100,
author = {Modorato, Sara},
title = {{Tracking Under Countermeasures Using Infrared Imagery}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5473--SE}},
year = {2022},
address = {Sweden},
}
Unmanned aerial vehicles (UAVs) with high-resolution cameras are common in today’s society. Industries, such as the forestry industry, use drones to get a fast overview of tree populations. More advanced sensors, such as near-infrared light or depth data, can increase the amount of information that UAV images provide, providing information about the forest, such as; tree quantity or forest health. However, the fast-expanding field of deep learning could help expand the information acquired using only RGB cameras. Three deep learning models, FasterR-CNN, RetinaNet, and YOLOR were compared to investigate this. It was also investigated if initializing the models using transfer learning from the MS COCO dataset could increase the performance of the models. The dataset used was Swedish Forest Agency (2021): Forest Damages-Spruce Bark Beetle 1.0 National Forest Data Lab and drone images provided by IT-Bolaget Per & Per. The deep learning models were to detect five different tree species; spruce, pine, birch, aspen, and others. The results show potential for the usage of deep learning to detect tree species in images from UAVs.
@mastersthesis{diva2:1676909,
author = {Sievers, Olle},
title = {{CNN-Based Methods for Tree Species Detection in UAV Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5502--SE}},
year = {2022},
address = {Sweden},
}
Estimation of forest parameters using remote sensing information could streamline the forest industry from a time and economic perspective. This thesis utilizes object detection and semantic segmentation to detect and classify individual trees from images over 3D models reconstructed from satellite images. This thesis investigated two methods that showed different strengths in detecting and classifying trees in deciduous, evergreen, or mixed forests. These methods are not just valuable for forest inventory but can be greatly useful for telecommunication companies and in defense and intelligence applications. This thesis also presents methods for estimating tree volume and estimating tree growth in 3D models. The results from the methods show the potential to be used in forest management. Finally, this thesis shows several benefits of managing a digitalized forest, economically, environmentally, and socially.
@mastersthesis{diva2:1673885,
author = {Dahm\'{e}n, Gustav and Strand, Erica},
title = {{Forest Growth And Volume Estimation Using Machine Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5508--SE}},
year = {2022},
address = {Sweden},
}
Ceramic materials contain several defects, one of which is porosity. At the time of writing, porosity measurement is a manual and time-consuming process performed by a human operator. With advances in deep learning for computer vision, this thesis explores to what degree convolutional neural networks and semantic segmentation can reliably measure porosity from microscope images. Combining classical image processing techniques with deep learning, images were automatically labeled and then used for training semantic segmentation neural networks leveraging transfer learning. Deep learning-based methods were more robust and could more reliably identify porosity in a larger variety of images than solely relying on classical image processing techniques.
@mastersthesis{diva2:1674176,
author = {Isaksson, Filip},
title = {{Measuring Porosity in Ceramic Coating using Convolutional Neural Networks and Semantic Segmentation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5490--SE}},
year = {2022},
address = {Sweden},
}
With the increasing demand for labeled data in machine learning for visual perception tasks, the interest in using synthetically generated data has grown. Due to the existence of a domain gap between synthetic and real data, strategies in domain adaptation are necessary to achieve high performance with models trained on synthetic or mixed data.
With a dataset of synthetically blocked fish-eye lenses in traffic environments, we explore different strategies to train a neural network. The neural network is a binary classifier for full blockage detection. The different strategies tested are data mixing, fine-tuning, domain adversarial training, and adversarial discriminative domain adaptation. Different ratios between synthetically generated data and real data are also tested. Our experiments showed that fine-tuning had slightly superior results in this test environment. To fully take advantage of the domain adversarial training, training until domain indiscriminate features are learned is necessary and helps the model attain higher performance than using random data mixing.
@mastersthesis{diva2:1671549,
author = {Tran, Hoang},
title = {{Learning with Synthetically Blocked Images for Sensor Blockage Detection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5509--SE}},
year = {2022},
address = {Sweden},
}
An autonomous vehicle is a complex system that requires a good perception of the surrounding environment to operate safely. One part of that is multiple object tracking, which is an essential component in camera-based perception whose responsibility is to estimate object motion from a sequence of images. This requires an association problem to be solved where newly estimated object positions are mapped to previously predicted trajectories, for which different solution strategies exist.
In this work, a multiple hypothesis tracking algorithm is implemented. The purpose is to demonstrate that measurement associations are improved compared to less compute-intensive alternatives. It was shown that the implemented algorithm performed 13 percent better than an intersection over union tracker when evaluated using a standard evaluation metric.
Furthermore, this work also investigates the usage of abstraction layers to accelerate time-critical parallel operations on the GPU. It was found that the execution time of the tracking algorithm could be reduced by 42 percent by replacing four functions with implementations written in the purely functional array language Futhark. Finally, it was shown that a GPU code abstraction layer can reduce the knowledge barrier required to write efficient CUDA kernels.
@mastersthesis{diva2:1670800,
author = {Nolkrantz, Marcus},
title = {{Efficient multiple hypothesis tracking using a purely functional array language}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5482--SE}},
year = {2022},
address = {Sweden},
}
Lens distortions appear in almost all digital images and cause straight lines to appear curved in the image. This can contribute to errors in position estimations and 3D reconstruction and it is therefore of interest to correct for the distortion. If the camera is available, the distortion parameters can be obtained when calibrating the camera. However, when the camera is unavailable the distortion parameters can not be found with the standard camera calibration technique and other approaches must be used. Recently, variants of Perspective-n-Point (PnP) extended with lens distortionand focal length parameters have been proposed. Given a set of 2D-3D point correspondences, the PnP-based methods can estimate distortion parameters without the camera being available or with modified settings. In this thesis, the performance of PnP-based methods is compared to Zhang’s camera calibration method. The methods are compared both quantitatively, using the errors in reprojectionand distortion parameters, and qualitatively by comparing images before and after lens distortion correction. A test set for the comparison was obtained from a camera and a 3D laser scanner of an indoor scene.The results indicate that one of the PnP-based models can achieve a similar reprojection error as the baseline method for one of the cameras. It could also be seen that two PnP-based models could reduce lens distortion when visually comparing the test images to the baseline. Moreover, it was noted that a model can have a small reprojection error even though the distortion coefficient error is large and the lens distortion is not completely removed. This indicates that it is important to include both quantitative measures, such as reprojection error and distortion coefficient errors, as well as qualitative results when comparing lens distortion correction methods. It could also be seen that PnP-based models with more parameters in the estimation are more sensitive to noise.
@mastersthesis{diva2:1670770,
author = {Olsson, Emily},
title = {{Lens Distortion Correction Without Camera Access}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5476--SE}},
year = {2022},
address = {Sweden},
}
Automatic detection of weeds could be used for more efficient weed control in agriculture. In this master thesis, weed detectors have been trained and examined on data collected by RISE to investigate whether an accurate weed detector could be trained on the collected data. When only using annotations of the weed class Creeping thistle for training and evaluation, a detector achieved a mAP of 0.33. When using four classes of weed, a detector was trained with a mAP of 0.07. The performance was worse than in a previous study also dealing with weed detection. Hypotheses for why the performance was lacking were examined. Experiments indicated that the problem could not fully be explained by the model being underfitted, nor by the object’s backgrounds being too similar to the foreground, nor by the quality of the annotations being too low. The performance was better when training the model with as much data as possible than when only selected segments of the data were used.
@mastersthesis{diva2:1666845,
author = {Ahlqvist, Axel},
title = {{Examining Difficulties in Weed Detection}},
school = {Linköping University},
type = {{}},
year = {2022},
address = {Sweden},
}
Radiologists often have to look through many different patients and examinations in quick succession, and to aid in the workflow the different types of images should be presented for the radiologist in the same manner and order between each new examination. Thus decreasing the time needed for the radiologist to either find the correct image or rearrange the images to their liking. A step in thisprocess requires a comparison between two images to be made and produce a score between 0-1 describing how similar the images are. A similar algorithm already exists at Sectra, but that algorithm only uses the metadata from the images without considering the actual pixel data.
The aim of this thesis were to explore different methods of doing the same comparison as the previous algorithm but only using the pixel data. Considering only 3D volumes from CT examinations of the abdomen and thorax region, this thesis explores the possibility of using SSIM, SIFT and SIFT together with a histogram comparison using the Bhattacharyya distance for this task. It was deemed very important that the ranking produced when ordering the images in terms of similarity to one reference image followed a specific order. This order was determined by consulting personnel at Sectra that works closely with the clinical side of radiology.
SSIM were able to differentiate between different plane orientations since they usually had large resolution differences in each led, but it could not be made tofollow the desired ranking and was thus disregarded as a reliable option for this problem. The method using SIFT followed the desired ranking better, but struggled a lot with differentiating between the different contrast phases. A histogram component were also added to this method, which increased the accuracy and improved the ranking. Although, further development is still needed for thismethod to be a reliable option that could be used in a clinical setting.
@mastersthesis{diva2:1665838,
author = {Castenbrandt, Felicia},
title = {{Image Similarity Scoring for Medical Images in 3D}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5484--SE}},
year = {2022},
address = {Sweden},
}
Deep learning has shown to be successful on the task of semantic segmentation of three-dimensional (3D) point clouds, which has many interesting use cases in areas such as autonomous driving and defense applications. A common type of sensor used for collecting 3D point cloud data is Light Detection and Ranging (LiDAR) sensors. In this thesis, a time-correlated single-photon counting (TCSPC) LiDAR is used, which produces very accurate measurements over long distances up to several kilometers. The dataset collected by the TCSPC LiDAR used in the thesis contains two classes, person and other, and it comes with several challenges due to it being limited in terms of size and variation, as well as being extremely class imbalanced. The thesis aims to identify, analyze, and evaluate state-of-the-art deep learning models for semantic segmentation of point clouds produced by the TCSPC sensor. This is achieved by investigating different loss functions, data variations, and data augmentation techniques for a selected state-of-the-art deep learning architecture. The results showed that loss functions tailored for extremely imbalanced datasets performed the best with regard to the metric mean intersection over union (mIoU). Furthermore, an improvement in mIoU could be observed when some combinations of data augmentation techniques were employed. In general, the performance of the models varied heavily, with some achieving promising results and others achieving much worse results.
@mastersthesis{diva2:1667072,
author = {Süsskind, Caspian},
title = {{Deep Learning Semantic Segmentation of 3D Point Cloud Data from a Photon Counting LiDAR}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5467--SE}},
year = {2022},
address = {Sweden},
}
When a camera system in a car is mounted behind the windshield, light rays will be refracted by the windshield. The distortion can be significant, especially for wide field-of-view cameras. Traditional approaches handle the windshield distortion along with the calibration that calculates the intrinsic and extrinsic parameters. However, these approaches do not handle the windshield distortion explicitly, and to understand the image formation, it brings to understand more about the windshield distortion effect. In this thesis, data is collected from a camera system viewed with and without the windshield. The windshield distortion effect has been studied by varying the windshield’s tilt and the camera’s setup. Points are then found in both images and matched. From this, a distortion difference is calculated and analyzed. Next, a preliminary model of the windshield distortion effect is presented and evaluated. The results show that the model works well for all cases and the two windshields considered in this thesis.
@mastersthesis{diva2:1638117,
author = {Luong, Therese},
title = {{Windshield Distortion Modelling}},
school = {Linköping University},
type = {{LiTH-ISY-EX--22/5455--SE}},
year = {2022},
address = {Sweden},
}
Being able to train machine learning models on simulated data can be of great interest in several applications, one of them being for autonomous driving of cars. The reason is that it is easier to collect large labeled datasets as well as performing reinforcement learning in simulations. However, transferring these learned models to the real-world environment can be hard due to differences between the simulation and the reality; for example, differences in material, textures, lighting and content. One approach is to use domain adaptation, by making the simulations as similar as possible to the reality. The thesis's main focus is to investigate domain adaptation as a way to meet the reality-gap, and also compare it to an alternative method, domain randomization.
Two different methods of domain adaptation; one adapting the simulated data to reality, and the other adapting the test data to simulation, are compared to using domain randomization. These are evaluated with a classifier making decisions for a robot car while driving in reality. The evaluation consists of a quantitative evaluation on real-world data and a qualitative evaluation aiming to observe how well the robot is driving and avoiding obstacles. The results show that the reality-gap is very large and that the examined methods reduce it, with the two using domain adaptation resulting in the largest decrease. However, none of them led to satisfactory driving.
@mastersthesis{diva2:1624770,
author = {Forsberg, Fanny},
title = {{Domain Adaptation to Meet the Reality-Gap from Simulation to Reality}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5453--SE}},
year = {2022},
address = {Sweden},
}
This thesis investigates methods for automatic colour transfer when working withgeodata and possible metrics to evaluate the results. Several methods for colourtransfer as well as methods to create an objective measurement were tested. Themethod was evaluated by using a subjective score which was generated by surveyingeight people working with geodata. In the survey the participants were askedto “Rank the images from most similar to least similar, with what you imagine theresult would have been if you would have made the colour transfer manually”.The method with the best overall performance in this study was using colourtransfer in the CIEl colour space. This method was only matched by a methodsegmenting the image first based on colour information. As the method had thehighest average subjective score but a larger standard deviation than other methods.This was suspected to be largely due to the deviation in quality of the segmentationalgorithm. Using a different method for segmenting the image thismethod might perform even better.The objective measurements proposed in this study were not found to have aconsistent correlation with the subjective measurement, with the exception ofgradient structural similarity. Other methods could have a use in some cases butnot as general colour transfer objective measurement, though a larger study andmore data would be needed to confirm the findings.
@mastersthesis{diva2:1601738,
author = {Ågren, Anton},
title = {{Automatic Colour Transfer for Geodata}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5378--SE}},
year = {2021},
address = {Sweden},
}
In this thesis, three well known self-supervised methods have been implemented and trained on road scene images. The three so called pretext tasks RotNet, MoCov2, and DeepCluster were used to train a neural network self-supervised. The self-supervised trained networks where then evaluated on different amount of labeled data on two downstream tasks, object detection and semantic segmentation. The performance of the self-supervised methods are compared to networks trained from scratch on the respective downstream task. The results show that it is possible to achieve a performance increase using self-supervision on a dataset containing road scene images only. When only a small amount of labeled data is available, the performance increase can be substantial, e.g., a mIoU from 33 to 39 when training semantic segmentation on 1750 images with a RotNet pre-trained backbone compared to training from scratch. However, it seems that when a large amount of labeled images are available (>70000 images), the self-supervised pretraining does not increase the performance as much or at all.
@mastersthesis{diva2:1608285,
author = {Gustavsson, Simon},
title = {{Object Detection and Semantic Segmentation Using Self-Supervised Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5357--SE}},
year = {2021},
address = {Sweden},
}
Point set registration is a well-researched yet still not a very exploited area in computer vision. As the field of machine learning grows, the possibilities of application expand. This thesis investigates the possibility to expand an already implemented probabilistic machine learning approach to point set registration to more complex, larger datasets gathered in a forest environment. The system used as a starting point was created by Järemo Lawin et. al. [10]. The aim of the thesis was to investigate the possibility to register the forest data with the existing system, without ground-truth poses, with different optimizers, and to implement a SLAM pipeline. Also, older methods were used as a benchmark for evaluation, more specifically iterative closest point(ICP) and fast global registration(FGR).To enable the gathered data to be processed by the registration algorithms, preprocessing was required. Transforming the data points from the coordinate system of the sensor to world relative coordinates via LiDAR base coordinates. Subsequently, the registration was performed with different approaches. Both the KITTI odometry dataset, which RLLReg originally was evaluated with[10], and the gathered forest data were used. Data augmentation was utilized to enable ground-truth-independent training and to increase diversity in the data. In addition, the registration results were used to create a SLAM-pipeline, enabling mapping and localization in the scanned areas. The results showed great potential for using RLLReg to register forest scenes compared to other, older, approaches. Especially, the lack of ground-truth was manageable using data augmentation to create training data. Moreover, there was no evidence that AdaBound improves the system when replacing the Adam-optimizer. Finally, forest models with sensor paths plotted were generated with decent results. However, a potential for post-processing with further refinement is possible. Nevertheless, the possibility of point set registration and LiDAR-SLAM using machine learning has been confirmed.
@mastersthesis{diva2:1612438,
author = {Hjert, Anton},
title = {{Machine Learning for LiDAR-SLAM:
In Forest Terrains}},
school = {Linköping University},
type = {{}},
year = {2021},
address = {Sweden},
}
Hyperspectral imaging based on the use of an exponentially variable filter gives the possibility to construct a lightweight hyperspectral sensor. The exponentially variable filter captures the whole spectral range in each image where each column captures a different wavelength. To gather the full spectrum for any given point in the image requires the fusion of several gathered images with movement in between captures. The construction of a hyperspectral cube requires registration of the gathered images. With a lightweight sensor comes the possibility to mount the hyperspectral sensor on an unmanned aerial vehicle to collect aerial footage. This thesis presents a registration algorithm capable of constructing a complete hyperspectral cube of almost any chosen area in the captured region. The thesis presents the result of a construction method using a multi-frame super-resolution algorithm trying to increase the spectral resolution and a spline interpolation method interpolating missing spectral data. The result of an algorithm trying to suggest the optimal spectral and spatial resolution before constructing the hyperspectral cube is also presented. Lastly, the result of an algorithm providing information about the quality of the constructed hyperspectral cube is also presented.
@mastersthesis{diva2:1596253,
author = {Freij, Hannes},
title = {{Hyperspectral Image Registration and Construction From Irregularly Sampled Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5408--SE}},
year = {2021},
address = {Sweden},
}
In Storytel’s application on which a user can read and listen to digitalized literature, a user is displayed a list of books where the first thing the user encounters is the book title and cover. A book cover is therefore essential to attract a consumer’s attention. In this study, we take a data-driven approach to investigate the design principles for book covers through deep learning models and explainable AI. The first aim is to explore how well a Convolutional Neural Network (CNN) can interpret and classify a book cover image according to its genre in a multi-class classification task. The second aim is to increase model interpretability and investigate model feature to genre correlations. With the help of the explanatory artificial intelligence method Gradient-weighted Class Activation Map (Grad-CAM), we analyze the pixel-wise contribution to the model prediction. In addition, object detection by YOLOv3 was implemented to investigate which objects are detectable and reoccurring in the book covers. An interplay between Grad-CAM and YOLOv3 was used to investigate how identified objects and features correlate to a specific book genre and ultimately answer what makes a good book cover. Using a State-of-the-Art CNN model architecture we achieve an accuracy of 48% with the best class-wise accuracies for genres Erotica, Economy & Business and Children with accuracies 73%, 67% and 66%. Quantitative results from the Grad-CAM and YOLOv3 interplay show some strong associations between objects and genres, while indicating weak associations between abstract design principles and genres. Furthermore, a qualitative analysis of Grad-CAM visualizations show strong relevance of certain objects and text fonts for specific book genres. It was also observed that the portrayal of a feature was relevant for the model prediction of certain genres.
@mastersthesis{diva2:1576364,
author = {Velander, Alice and Gumpert Harrysson, David},
title = {{Do Judge a Book by its Cover!
Predicting the genre of book covers using supervised deep learning. Analyzing the model predictions using explanatory artificial intelligence methods and techniques.}},
school = {Linköping University},
type = {{}},
year = {2021},
address = {Sweden},
}
Hyperspectral imaging is an expanding topic within the field of computer vision, that uses images of high spectral granularity. Contrastive learning is a discrim- inative approach to self-supervised learning, a form of unsupervised learning where the network is trained using self-created pseudo-labels. This work com- bines these two research areas and investigates how a pretrained network based on contrastive learning can be used for hyperspectral images. The hyperspectral images used in this work are generated from simulated RGB images and spec- tra from a spectral library. The network is trained with a pretext task based on data augmentations, and is evaluated through transfer learning and fine-tuning for a downstream task. The goal is to determine the impact of the pretext task on the downstream task and to determine the required amount of labelled data. The results show that the downstream task (a classifier) based on the pretrained network barely performs better than a classifier without a pretrained network. In the end, more research needs to be done to confirm or reject the benefit of a pretrained network based on contrastive learning for hyperspectral images. Also, the pretrained network should be tested on real-world hyperspectral data and trained with a pretext task designed for hyperspectral images.
@mastersthesis{diva2:1593358,
author = {Syr\'{e}n Grönfelt, Natalie},
title = {{Pretraining a Neural Network for Hyperspectral Images Using Self-Supervised Contrastive Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5382--SE}},
year = {2021},
address = {Sweden},
}
In the glass wool industry, the molten glass flow is monitored for regulation purposes. Given the progress in the computer vision field, the current monitoring solution might be replaced by a camera based solution. The aim of this thesis is to investigate the possibility of using optical flow techniques for estimation of the molten glass flow displacement.
Three glass melt flow datasets were recorded, as well as two additional melt flow datasets, using a NIR camera. The block matching techniques Full Search (FS) and Adaptive Rood Pattern Search (ARPS), as well as the local feature methods ORB and A-KAZE were considered. These four techniques were compared to RAFT, the state-of-the-art approach for optical flow estimation, using available pre-trained models, as well as an approach of using the tracking method ECO for the optical flow estimation.
The methods have been evaluated using the metrics MAE, MSE, and SSIM to compare the warped flow to the target image. In addition, ground truth for 50 frames from each dataset was manually annotated as to use the optical flow metric End-Point Error. To investigate the computational complexity the average computational time per frame was calculated.
The investigation found that RAFT does not perform well on the given data, due to the large displacements of the flows. For simulated displacements of up to about 100 pixels at full resolution, the performance is satisfactory, with results comparable to the traditional methods.
Using ECO for optical flow estimation encounters similar problems as RAFT, where the large displacement proved challenging for the tracker. Simulating smaller motions of up to 60 pixels resulted in good performance, though computation time of the used implementation is much too high for a real-time implementation.
The four traditional block matching and local feature approaches examined in this thesis outperform the state-of-the-art approaches. FS, ARPS, A-KAZE, and ORB all have similar performance on the glass flow datasets, whereas the block matching approaches fail on the alternative melt flow data as the template extraction approach is inadequate. The two local feature approaches, though working reasonably well on all datasets given full resolution, struggle to identify features on down-sampled data. This might be mitigated by fine-tuning the settings of the methods. Generally, ORB mostly outperforms A-KAZE with respect to the evaluation metrics, and is considerably faster.
@mastersthesis{diva2:1592777,
author = {Rudin, Malin},
title = {{Evaluation of Optical Flow for Estimation of Liquid Glass Flow Velocity}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5433--SE}},
year = {2021},
address = {Sweden},
}
When radiologists examine X-rays, it is crucial that they are aware of the laterality of the examined body part. The laterality refers to which side of the body that is considered, e.g. Left and Right. The consequences of a mistake based on information regarding the incorrect laterality could be disastrous. This thesis aims to address this problem by providing a deep neural network model that classifies X-rays based on their laterality.
X-ray images contain markers that are used to indicate the laterality of the image. In this thesis, both a classification model and a detection model have been trained to detect these markers and to identify the laterality. The models have been trained and evaluated on four body parts: knees, feet, hands and shoulders. The images can be divided into three laterality classes: Bilateral, Left and Right.
The model proposed in this thesis is a combination of two classification models: one for distinguishing between Bilateral and Unilateral images, and one for classifying Unilateral images as Left or Right. The latter utilizes the confidence of the predictions to categorize some of them as less accurate (Uncertain), which includes images where the marker is not visible or very hard to identify.
The model was able to correctly distinguish Bilateral from Unilateral with an accuracy of 100.0 %. For the Unilateral images, 5.00 % were categorized as Uncertain and for the remaining images, 99.99 % of those were classified correctly as Left or Right.
@mastersthesis{diva2:1587188,
author = {Björn, Martin},
title = {{Laterality Classification of X-Ray Images:
Using Deep Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5417-SE}},
year = {2021},
address = {Sweden},
}
Deep learning methods for medical image segmentation are hindered by the lack of training data. This thesis aims to develop a method that overcomes this problem. Basic U-net trained on XCAT phantom data was tested first. The segmentation results were unsatisfactory even when artificial quantum noise was added. As a workaround, CycleGAN was used to add tissue textures to the XCAT phantom images by analyzing patient CT images. The generated images were used totrain the network. The textures introduced by CycleGAN improved the segmentation, but some errors remained. Basic U-net was replaced with Attention U-net, which further improved the segmentation. More work is needed to fine-tune and thoroughly evaluate the method. The results obtained so far demonstrate the potential of this method for the segmentation of medical images. The proposed algorithms may be used in iterative image reconstruction algorithms in multi-energy computed tomography.
@mastersthesis{diva2:1584712,
author = {ZHAO, HANG},
title = {{Segmentation and synthesis of pelvic region CT images via neural networks trained on XCAT phantom data}},
school = {Linköping University},
type = {{}},
year = {2021},
address = {Sweden},
}
This thesis provides a comparison between instance segmentation methods using point clouds and depth images. Specifically, their performance on cluttered scenes of irregular objects in an industrial environment is investigated.
Recent work by Wang et al. [1] has suggested potential benefits of a point cloud representation when performing deep learning on data from 3D cameras. However, little work has been done to enable quantifiable comparisons between methods based on different representations, particularly on industrial data.
Generating synthetic data provides accurate grayscale, depth map, and point cloud representations for a large number of scenes and can thus be used to compare methods regardless of datatype. The datasets in this work are created using a tool provided by SICK. They simulate postal packages on a conveyor belt scanned by a LiDAR, closely resembling a common industry application. Two datasets are generated. One dataset has low complexity, containing only boxes.The other has higher complexity, containing a combination of boxes and multiple types of irregularly shaped parcels.
State-of-the-art instance segmentation methods are selected based on their performance on existing benchmarks. We chose PointGroup by Jiang et al. [2], which uses point clouds, and Mask R-CNN by He et al. [3], which uses images.
The results support that there may be benefits of using a point cloud representation over depth images. PointGroup performs better in terms of the chosen metric on both datasets. On low complexity scenes, the inference times are similar between the two methods tested. However, on higher complexity scenes, MaskR-CNN is significantly faster.
@mastersthesis{diva2:1584003,
author = {Konradsson, Albin and Bohman, Gustav},
title = {{3D Instance Segmentation of Cluttered Scenes:
A Comparative Study of 3D Data Representations}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5421--SE}},
year = {2021},
address = {Sweden},
}
This thesis investigates the possibility of utilizing data from multiple modalities to enable an automated recycling system to separate ferrous from non-ferrous debris. The two methods sensor fusion and hallucinogenic sensor fusion were implemented in a four-step approach of deep CNNs. Sensor fusion implies that multiple modalities are run simultaneously during the operation of the system.The individual outputs are further fused, and the joint performance expects to be superior to having only one of the sensors. In hallucinogenic sensor fusion, the goal is to achieve the benefits of sensor fusion in respect to cost and complexity even when one of the modalities is reduced from the system. This is achieved by leveraging data from a more complex modality onto a simpler one in a student/teacher approach. As a result, the teacher modality will train the student sensor to hallucinate features beyond its visual spectra. Based on the results of a performed prestudy involving multiple types of modalities, a hyperspectral sensor was deployed as the teacher to complement a simple RGB camera. Three studies involving differently composed datasets were further conducted to evaluate the effectiveness of the methods. The results show that the joint performance of a hyperspectral sensor and an RGB camera is superior to both individual dispatches. It can also be concluded that training a network with hyperspectral images can improve the classification accuracy when operating with only RGB data. However, the addition of a hyperspectral sensor might be considered as superfluous as this report shows that the standardized shapes of industrial debris enable a single RGB to achieve an accuracy above 90%. The material used in this thesis can also be concluded to be suboptimal for hyperspectral analysis. Compared to the vegetation scenes, only a limited amount of additional data could be obtained by including wavelengths besides the ones representing red, green and blue.
@mastersthesis{diva2:1582328,
author = {Brundin, Sebastian and Gräns, Adam},
title = {{Efficient Recycling Of Non-Ferrous Materials Using Cross-Modal Knowledge Distillation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5403--SE}},
year = {2021},
address = {Sweden},
}
This master thesis studies the learning of dense feature descriptors where camera poses are the only supervisory signal. The use of camera poses as a supervisory signal has only been published once before, and this thesis expands on this previous work by utilizing a couple of different techniques meant increase the robustness of the method, which is particularly important when not having access to ground-truth correspondences. Firstly, an adaptive robust loss is utilized to better differentiate inliers and outliers. Secondly, statistical properties during training are both enforced and adapted to, in an attempt to alleviate problems with uncertainties introduced by not having true correspondences available. These additions are shown to slightly increase performance, and also highlights some key ideas related to prediction certainty and robustness when working with camera poses as a supervisory signal. Finally, possible directions for future work are discussed.
@mastersthesis{diva2:1573398,
author = {Dahlqvist, Marcus},
title = {{Adaptive Losses for Camera Pose Supervision}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5422--SE}},
year = {2021},
address = {Sweden},
}
Image segmentation through neural networks and deep learning have, in the recent decade, become a successful tool for automated decision-making. For Luossavaara-Kiirunavaara Aktiebolag (LKAB), this means identifying the amount of slag inside a furnace through computer vision.
There are many prominent convolutional neural network architectures in the literature, and this thesis explores two: a modified U-Net and the PSPNet. The architectures were combined with three loss functions and three class weighting schemes resulting in 18 model configurations that were evaluated and compared. This thesis also explores transfer learning techniques for neural networks tasked with identifying slag in images from inside a furnace. The benefit of transfer learning is that the network can learn to find features from already labeled data of another context. Finally, the thesis explored how temporal information could be utilised by adding an LSTM layer to a model taking pairs of images as input, instead of one.
The results show (1) that the PSPNet outperformed the U-Net for all tested configurations in all relevant metrics, (2) that the model is able to find more complex features while converging quicker by using transfer learning, and (3) that utilising temporal information reduced the variance of the predictions, and that the modified PSPNet using an LSTM layer showed promise in handling images with outlying characteristics.
@mastersthesis{diva2:1572304,
author = {von Koch, Christian and Anz\'{e}n, William},
title = {{Detecting Slag Formation with Deep Learning Methods:
An experimental study of different deep learning image segmentation models}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5427--SE}},
year = {2021},
address = {Sweden},
}
Detecting and outlining products in images is beneficial for many use cases in e-commerce, such as automatically identifying and locating products within images and proposing matches for the detections. This study investigated how the utilisation of metadata associated with images of products could help boost the performance of an existing approach with the ultimate goal of reducing manual labour needed to annotate images. This thesis explored if approximate pseudo masks could be generated for products in images by leveraging metadata as image-level labels and subsequently using the masks to train a Mask R-CNN. However, this approach did not result in satisfactory results. Further, this study found that by incorporating the metadata directly in the Mask R-CNN, an mAP performance increase of nearly 5\% was achieved. Furthermore, utilising the available metadata to divide the training samples for a KNN model into subsets resulted in an increased top-3 accuracy of up to 16\%. By representing the data with embeddings created by a pre-trained CNN, the KNN model performed better with both higher accuracy and more reasonable suggestions.
@mastersthesis{diva2:1570488,
author = {Wahlquist, Gustav},
title = {{Improving Automatic Image Annotation Using Metadata}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5398--SE}},
year = {2021},
address = {Sweden},
}
Learning-based multi-view stereo (MVS) has shown promising results in the domain of general 3D reconstruction. However, no work before this thesis has applied learning-based MVS to urban 3D reconstruction from satellite images. In this thesis, learning-based MVS is used to infer depth maps from satellite images. Models are trained on both synthetic and real satellite images from Las Vegas with ground truth data from a high-resolution aerial-based 3D model. This thesis also evaluates different methods for reconstructing digital surface models (DSM) and compares them to existing satellite-based 3D models at Maxar Technologies. The DSMs are created by either post-processing point clouds obtained from predicted depth maps or by an end-to-end approach where the depth map for an orthographic satellite image is predicted.
This thesis concludes that learning-based MVS can be used to predict accurate depth maps. Models trained on synthetic data yielded relatively good results, but not nearly as good as for models trained on real satellite images. The trained models also generalize relatively well to cities not present in training. This thesis also concludes that the reconstructed DSMs achieve better quantitative results than the existing 3D model in Las Vegas and similar results for the test sets from other cities. Compared to ground truth, the best-performing method achieved an L1 and L2 error of 14 % and 29 % lower than Maxar's current 3D model, respectively. The method that uses a point cloud as an intermediate step achieves better quantitative results compared to the end-to-end system. Very promising qualitative results are achieved with the proposed methods, especially when utilizing an end-to-end approach.
@mastersthesis{diva2:1567722,
author = {Yngesjö, Tim},
title = {{3D Reconstruction from Satellite Imagery Using Deep Learning}},
school = {Linköping University},
type = {{}},
year = {2021},
address = {Sweden},
}
The increasing popularity of drones has made it convenient to capture a large number of images of a property, which can then be used to build a 3D model. The conditions of buildings can be analyzed to plan renovations. This creates an interest for automatically identifying building materials, a task well suited for machine learning.
With access to drone imagery of buildings as well as depth maps and normal maps, we created a dataset for semantic segmentation. Two different convolutional neural networks were trained and evaluated, to see how well they perform material segmentation. DeepLabv3+, which uses RGB data, was compared to Depth-Aware CNN, which uses RGB-D data. Our experiments showed that DeepLabv3+ achieved higher mean intersection over union.
To investigate if the information in the depth maps and normal maps could give a performance boost, we conducted experiments with an encoding we call HMN - horizontal disparity, magnitude of normal with ground, normal parallel with gravity. This three channel encoding was used to jointly train two CNNs, one with RGB and one with HMN, and then sum their predictions. This led to improved results for both DeepLabv3+ and Depth-Aware CNN.
@mastersthesis{diva2:1567671,
author = {Rydgård, Jonas and Bejgrowicz, Marcus},
title = {{Semantic Segmentation of Building Materials in Real World Images Using 3D Information}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5405--SE}},
year = {2021},
address = {Sweden},
}
Generic visual object tracking is the task of tracking one or several objects in all frames in a video, knowing only the location and size of the target in the initial frame. Visual tracking can be carried out in both the infrared and the visual spectrum simultaneously, this is known as multi-modal tracking. Utilizing both spectra can result in a more diverse tracker since visual tracking in infrared imagery makes it possible to detect objects even in poor visibility or in complete darkness. However, infrared imagery lacks the number of details that are present in visual images. A common method for visual tracking is to use discriminative correlation filters (DCF). These correlation filters are then used to detect an object in every frame of an image sequence. This thesis focuses on investigating aspects of a DCF based tracker, operating in the two different modalities, infrared and visual imagery. First, it was investigated whether the tracking benefits from using two channels instead of one and what happens to the tracking result if one of those channels is degraded by an external cause. It was also investigated if the addition of image features can further improve the tracking. The result shows that the tracking improves when using two channels instead of only using a single channel. It also shows that utilizing two channels is a good way to create a robust tracker, which is still able to perform even though one of the channels is degraded. Using deep features, extracted from a pre-trained convolutional neural network, was the image feature improving the tracking the most, although the implementation of the deep features made the tracking significantly slower.
@mastersthesis{diva2:1566492,
author = {Wettermark, Emma and Berglund, Linda},
title = {{Multi-Modal Visual Tracking Using Infrared Imagery}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5401--SE}},
year = {2021},
address = {Sweden},
}
This thesis investigates the development and use of software to measure respiratory frequency on cows using optronics and computer vision. It examines mainly two different strategies of image and signal processing and their performances for different input qualities. The effect of heat stress on dairy cows and the high transmission risk of pneumonia for calves make the investigation done during this thesis highly relevant since they both have the same symptom; increased respiratory frequency. The data set used in this thesis was of recorded dairy cows in different environments and from varying angles. Recordings, where the authors could determine a true breathing frequency by monitoring body movements, were accepted to the data set and used to test and develop the algorithms. One method developed in this thesis estimated the breathing rate in the frequency domain by Fast Fourier Transform and was named "N-point Fast Fourier Transform." The other method was called "Breathing Movement Zero-Crossing Counting." It estimated a signal in the time domain, whose fundamental frequency was determined by a zero-crossing algorithm as the breathing frequency. The result showed that both the developed algorithm successfully estimated a breathing frequency with a reasonable error margin for most of the data set. The zero-crossing algorithm showed the most consistent result with an error margin lower than 0.92 breaths per minute (BPM) for twelve of thirteen recordings. However, it is limited to recordings where the camera is placed above the cow. The N-point FFT algorithm estimated the breathing frequency with error margins between 0.44 and 5.20 BPM for the same recordings as the zero-crossing algorithm. This method is not limited to a specific camera angle but requires the cow to be relatively stationary to get accurate results. Therefore, it could be evaluated with the remaining three recordings of the data set. The error margins for these recordings were measured between 1.92 and 10.88 BPM. Both methods had execution time acceptable for implementation in real-time. It was, however, too incomplete a data set to determine any performance with recordings from different optronic devices.
@mastersthesis{diva2:1563490,
author = {Antonsson, Per and Johansson, Jesper},
title = {{Measuring Respiratory Frequency Using Optronics and Computer Vision}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5376--SE}},
year = {2021},
address = {Sweden},
}
Reconstruction of sonar images is an inverse problem, which is normally solved with model-based methods. These methods may introduce undesired artifacts called angular and range leakage into the reconstruction. In this thesis, a method called Learned Primal-Dual Reconstruction, which combines a data-driven and a model-based approach, is used to investigate the use of data-driven methods for reconstruction within sonar imaging. The method uses primal and dual variables inspired by classical optimization methods where parts are replaced by convolutional neural networks to iteratively find a solution to the reconstruction problem. The network is trained and validated with synthetic data on eight models with different architectures and training parameters. The models are evaluated on measurement data and the results are compared with those from a purely model-based method. Reconstructions performed on synthetic data, where a ground truth image is available, show that it is possible to achieve reconstructions with the data-driven method that have less leakage than reconstructions from the model-based method. For reconstructions performed on measurement data where no ground truth is available, some variants of the learned model achieve a good result with less leakage.
@mastersthesis{diva2:1561999,
author = {Nilsson, Lovisa},
title = {{Data-Driven Methods for Sonar Imaging}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5381--SE}},
year = {2021},
address = {Sweden},
}
When photos and videos are increasingly used as evidence material, it is of importance to know if these materials can be used as evidence material or if the risk of them being forged is impending. This thesis investigates methods for detecting anomalous regions in images and videos using photo-response non-uniformity -- a fixed-pattern sensor noise that can be estimated from photos or videos.
For photos, experiments were performed on a method that assumes other photos from the same camera are available. For videos, experiments were performed on a method further developed from the still image method, with other videos from the same camera being available. The last experiments were performed on videos when only the video that was about to be investigated was available.
The experiments on the still image method were performed on images with three different kinds of forged regions: a forged region from somewhere else in the same photo, a forged region from a photo taken by another camera, and a forged region from the same sensor position in a photo taken by the same camera. The method should not be able to detect the third kind of forged region. Experiments performed on videos had a forged region in several adjacent frames in the video. The forged region was from another video, and it moved and changed shape between the frames.
The methods mainly consist of a classification process and some post-processing. In the classification process, features were extracted from images/videos and used in a random forest classifier. The results are presented in precision, recall, F1 score and false positive rate.
The quality of the still images was generally better than the videos, which also resulted in better results. For the cameras used in the experiments, it seemed easier to estimate a good PRNU pattern from photos and videos from older cameras. Probably due to sensor differences and extra processing in newer camera models. How the images and videos are compressed also affects the possibility to estimate a good PRNU pattern, because important information may then be lost.
@mastersthesis{diva2:1552602,
author = {Söderqvist, Kerstin},
title = {{Anomaly Detection in Images and Videos Using Photo-Response Non-Uniformity}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5367--SE}},
year = {2021},
address = {Sweden},
}
Instance segmentation has a great potential for improving the current state of littering by autonomously detecting and segmenting different categories of litter. With this information, litter could, for example, be geotagged to aid litter pickers or to give precise locational information to unmanned vehicles for autonomous litter collection. Land-based litter instance segmentation is a relatively unexplored field, and this study aims to give a comparison of the instance segmentation models Mask R-CNN and DetectoRS using the multiclass litter dataset called Trash Annotations in Context (TACO) in conjunction with the Common Objects in Context precision and recall scores. TACO is an imbalanced dataset, and therefore imbalanced data-handling is addressed, exercising a second-order relation iterative stratified split, and additionally oversampling when training Mask R-CNN. Mask R-CNN without oversampling resulted in a segmentation of 0.127 mAP, and with oversampling 0.163 mAP. DetectoRS achieved 0.167 segmentation mAP, and improves the segmentation mAP of small objects most noticeably, with a factor of at least 2, which is important within the litter domain since small objects such as cigarettes are overrepresented. In contrast, oversampling with Mask R-CNN does not seem to improve the general precision of small and medium objects, but only improves the detection of large objects. It is concluded that DetectoRS improves results compared to Mask R-CNN, as well does oversampling. However, using a dataset that cannot have an all-class representation for train, validation, and test splits, together with an iterative stratification that does not guarantee all-class representations, makes it hard for future works to do exact comparisons to this study. Results are therefore approximate considering using all categories since 12 categories are missing from the test set, where 4 of those were impossible to split into train, validation, and test set. Further image collection and annotation to mitigate the imbalance would most noticeably improve results since results depend on class-averaged values. Doing oversampling with DetectoRS would also help improve results. There is also the option to combine the two datasets TACO and MJU-Waste to enforce training of more categories.
@mastersthesis{diva2:1546705,
author = {Sievert, Rolf},
title = {{Instance Segmentation of Multiclass Litter and Imbalanced Dataset Handling:
A Deep Learning Model Comparison}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5365--SE}},
year = {2021},
address = {Sweden},
}
Training data is an essential ingredient within supervised learning, yet time con-suming, expensive and for some applications impossible to retrieve. Thus it isof interest to use synthetic training data. However, the domain shift of syntheticdata makes it challenging to obtain good results when used as training data fordeep learning models. It is therefore of interest to refine synthetic data, e.g. using image-to-image translation, to improve results. The aim of this work is to compare different methods to do image-to-image translation of synthetic training data of thermal IR-images using GANs. Translation is done both using synthetic thermal IR-images alone, as well as including pixelwise depth and/or semantic information. To evaluate, a new measure based on the Frechét Inception Distance, adapted to work for thermal IR-images is proposed. The results show that the model trained using IR-images alone translates the generated images closest to the domain of authentic thermal IR-images. The training where IR-images are complemented by corresponding pixelwise depth data performs second best. However, given more training time, inclusion of depth data has the potential to outperform training withirdata alone. This gives a valuable insight on how to best translate images from the domain of synthetic IR-images to that of authentic IR-images, which is vital for quick and low cost generation of training data for deep learning models.
@mastersthesis{diva2:1543340,
author = {Hamrell, Hanna},
title = {{Image-to-Image Translation for Improvement of Synthetic Thermal Infrared Training Data Using Generative Adversarial Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5364--SE}},
year = {2021},
address = {Sweden},
}
Perception of depth, ego-motion and robust keypoints is critical for SLAM andstructure from motion applications. Neural networks have achieved great perfor-mance in perception tasks in recent years. But collecting labeled data for super-vised training is labor intensive and costly. This thesis explores recent methodsin unsupervised training of neural networks that can predict depth, ego-motion,keypoints and do geometric consensus maximization. The benefit of unsuper-vised training is that the networks can learn from raw data collected from thecamera sensor, instead of labeled data. The thesis focuses on training on imagesfrom a monocular camera, where no stereo or LIDAR data is available. The exper-iments compare different techniques for depth and ego-motion prediction fromprevious research, and shows how the techniques can be combined successfully.A keypoint prediction network is evaluated and its performance is comparedwith the ORB detector provided by OpenCV. A geometric consensus network isalso implemented and its performance is compared with the RANSAC algorithmin OpenCV. The consensus maximization network is trained on the output of thekeypoint prediction network. For future work it is suggested that all networkscould be combined and trained jointly to reach a better overall performance. Theresults show (1) which techniques in unsupervised depth prediction are most ef-fective, (2) that the keypoint predicting network outperformed the ORB detector,and (3) that the consensus maximization network was able to classify outlierswith comparable performance to the RANSAC algorithm of OpenCV.
@mastersthesis{diva2:1534180,
author = {Örjehag, Erik},
title = {{Unsupervised Learning for Structure from Motion}},
school = {Linköping University},
type = {{LiTH-ISY-EX--21/5361--SE}},
year = {2021},
address = {Sweden},
}
In one of the facilities at the Stena Recycling plant in Halmstad, Sweden, about 300 tonnes of metallic waste is processed each day with the aim of sorting out all non-ferrous material. At the end of this process, non-ferrous materials are
manually sorted out from the ferrous materials. This thesis investigates a computer vision based approach to identify and localize the non-ferrous materials
and eventually automate the sorting.Images were captured of ferrous and non-ferrous materials. The images areprocessed and segmented to be used as annotation data for a deep convolutionalneural segmentation network. Network models have been trained on different
kinds and amounts of data. The resulting models are evaluated and tested in ac-cordance with different evaluation metrics. Methods of creating advanced train-ing data by merging imaging information were tested. Experiments with using
classifier prediction confidence to identify objects of unknown classes were per-formed.
This thesis shows that it is possible to discern ferrous from non-ferrous mate-rial with a purely vision based system. The thesis also shows that it is possible to
automatically create annotated training data. It becomes evident that it is possi-ble to create better training data, tailored for the task at hand, by merging image
data. A segmentation network trained on more than two classes yields lowerprediction confidence for objects unknown to the classifier.Substituting manual sorting with a purely vision based system seems like aviable approach. Before a substitution is considered, the automatic system needsto be evaluated in comparison to the manual sorting.
@mastersthesis{diva2:1552630,
author = {Almin, Fredrik},
title = {{Detection of Non-Ferrous Materials with Computer Vision}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5321--SE}},
year = {2020},
address = {Sweden},
}
3D reconstruction can be used in forensic science to reconstruct crime scenes and objects so that measurements and further information can be acquired off-site. It is desirable to use image based reconstruction methods but there is currently no procedure available for determining the uncertainty of such reconstructions. In this thesis the uncertainty of Structure from Motion is investigated. This is done by exploring the literature available on the subject and compiling the relevant information in a literary summary. Also, Monte Carlo simulations are conducted to study how the feature position uncertainty affects the uncertainty of the parameters estimated by bundle adjustment.
The experimental results show that poses of cameras that contain few image correspondences are estimated with higher uncertainty. The poses of such cameras are estimated with lesser uncertainty if they have feature correspondences in cameras that contain a higher number of projections.
@mastersthesis{diva2:1499090,
author = {Lindberg, Mimmi},
title = {{Forensic Validation of 3D models}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5346--SE}},
year = {2020},
address = {Sweden},
}
CNN-based (Convolutional Neural Network) visual object detectors often reach human level of accuracy but need to be trained with large amounts of manually annotated data. Collecting and annotating this data can frequently be time-consuming and financially expensive. Using generative models to augment the data can help minimize the amount of data required and increase detection per-formance. Many state-of-the-art generative models are Generative Adversarial Networks (GANs). This thesis investigates if and how one can utilize image data to generate new data through GANs to train a YOLO-based (You Only Look Once) object detector, and how CAD (Computer-Aided Design) models can aid in this process.
In the experiments, different models of GANs are trained and evaluated by visual inspection or with the Fréchet Inception Distance (FID) metric. The data provided by Ericsson Research consists of images of antenna and baseband equipment along with annotations and segmentations. Ericsson Research supplied the YOLO detector, and no modifications are made to this detector. Finally, the YOLO detector is trained on data generated by the chosen model and evaluated by the Average Precision (AP).
The results show that the generative models designed in this work can produce RGB images of high quality. However, the quality reduces if binary segmentation masks are to be generated as well. The experiments with CAD input data did not result in images that could be used for the training of the detector.
The GAN designed in this work is able to successfully replace objects in images with the style of other objects. The results show that training the YOLO detector with GAN-modified data compared to training with real data leads to the same detection performance. The results also show that the shapes and backgrounds of the antennas contributed more to detection performance than their style and colour.
@mastersthesis{diva2:1484523,
author = {Thaung, Ludwig},
title = {{Advanced Data Augmentation:
With Generative Adversarial Networks and Computer-Aided Design}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5340--SE}},
year = {2020},
address = {Sweden},
}
Forged videos of swapped faces, so-called deepfakes, have gained a lot of attention in recent years. Methods for automated detection of this type of manipulation are also seeing rapid progress in their development. The purpose of this thesis work is to evaluate the possibility and effectiveness of using deep embeddings from facial recognition networks as base for detection of such deepfakes. In addition, the thesis aims to answer whether or not the identity embeddings contain information that can be used for detection while analyzed over time and if it is suitable to include information about the person's head pose in this analysis. To answer these questions, three classifiers are created with the intent to answer one question each. Their performances are compared with each other and it is shown that identity embeddings are suitable as a basis for deepfake detection. Temporal analysis of the embeddings also seem effective, at least for deepfake methods that only work on a frame-by-frame basis. Including information about head poses in the videos is shown to not improve a classifier like this.
@mastersthesis{diva2:1476999,
author = {Emir, Alkazhami},
title = {{Facial Identity Embeddings for Deepfake Detection in Videos}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5341--SE}},
year = {2020},
address = {Sweden},
}
Multi-pedestrian tracking (MPT) is the task of localizing and following the trajectory of pedestrians in a sequence. Using an MPT algorithm is an important part in preventing pedestrian-vehicle collisions in Automated Driving (AD) and Advanced Driving Assistance Systems (ADAS). It has benefited greatly from the advances in computer vision and machine learning in the last decades. Using a pedestrian detector, the tracking consists of associating the detections between frames and maintaining pedestrian identities throughout the sequence. This can be a challenging task due to occlusions, missed detections and complex scenes. The number of pedestrians is unknown, and it varies with time. Finding new methods for improving MPT is an active research field and there are many approaches found in the literature. This work focuses on improving the detection-to-track association, the data association, with the help of extracted color features for each pedestrian. Utilizing the recent improvements in object detection this work shows that classical color features still is relevant in pedestrian tracking for real time applications with limited computational resources. The appearance is not only used in the data association but also integrated in a new proposed method to avoid tracking errors due to missed detections. The results show that even with simple models the color appearance can be used to improve the tracking results. Evaluation on the commonly used Multi-Object Tracking-benchmark shows an improvement in the Multi-Object Tracking Accuracy and identity switches, while keeping other measures essentially unchanged.
@mastersthesis{diva2:1467160,
author = {Flodin, Frida},
title = {{Improved Data Association for Multi-Pedestrian Tracking Using Image Information}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5329--SE}},
year = {2020},
address = {Sweden},
}
The task of 6D pose estimation with deep learning is to train networks to, from an im-age of an object, determine the rotation and translation of the object. Impressive resultshave recently been shown in deep learning based 6D pose estimation. However, many cur-rent solutions rely on real-world data when training, which as opposed to synthetic data,requires time consuming annotation. In this thesis, we introduce a pipeline for generatingsynthetic ground truth data for deep 6D pose estimation, where annotation is done auto-matically. With a 3D CAD-model, we use Blender to render 2D images of the model fromdifferent view points. We also create all other relevant data needed for pose estimation, e.g.,the poses of an object, mask images and 3D keypoints on the object. Using this pipeline, itis possible to adjust different settings to reduce the domain gap between synthetic data andreal-world data and get better pose estimation results. Such settings could be changing themethod of extracting 3D keypoints and varying the scale of the object or the light settingsin the scene.The network used to test the performance of training on our synthetic data is PVNet,which achieves state-of-the-art results for 6D pose estimation. This architecture learns tofind 2D keypoints of the object in the image, as well as 2D–3D keypoint correspondences.With these correspondences, the Perspective-n-Point (PnP) algorithm is used to extract apose. We evaluate the pose estimation of the different settings on the synthetic data andcompare these results to other state-of-the-art work. We find that using only real-worlddata for training is worse than using a combination of synthetic and real-world data. Sev-eral other findings are that varying scale and lightning, in addition to adding random back-ground images to the rendered images improves results. Four different novel keypoint se-lection methods are introduced in this work, and tried against methods used in previouswork. We observe that our methods achieve similar or better results. Finally, we use thebest possible settings from the synthetic data pipeline, but with memory limitations on theamount of training data. We are close to state-of-the-art results, and could get closer withmore data.
@mastersthesis{diva2:1467210,
author = {Löfgren, Tobias and Jonsson, Daniel},
title = {{Generating Synthetic Data for Evaluation and Improvement of Deep 6D Pose Estimation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5339--SE}},
year = {2020},
address = {Sweden},
}
Light Detection and Ranging (LiDAR) sensors have many different application areas, from revealing archaeological structures to aiding navigation of vehicles. However, it is challenging to interpret and fully use the vast amount of unstructured data that LiDARs collect. Automatic classification of LiDAR data would ease the utilization, whether it is for examining structures or aiding vehicles.
In recent years, there have been many advances in deep learning for semantic segmentation of automotive LiDAR data, but there is less research on aerial LiDAR data. This thesis investigates the current state-of-the-art deep learning architectures, and how well they perform on LiDAR data acquired by an Unmanned Aerial Vehicle (UAV). It also investigates different training techniques for class imbalanced and limited datasets, which are common challenges for semantic segmentation networks. Lastly, this thesis investigates if pre-training can improve the performance of the models.
The LiDAR scans were first projected to range images and then a fully convolutional semantic segmentation network was used. Three different training techniques were evaluated: weighted sampling, data augmentation, and grouping of classes. No improvement was observed by the weighted sampling, neither did grouping of classes have a substantial effect on the performance. Pre-training on the large public dataset SemanticKITTI resulted in a small performance improvement, but the data augmentation seemed to have the largest positive impact. The mIoU of the best model, which was trained with data augmentation, was 63.7% and it performed very well on the classes Ground, Vegetation, and Vehicle. The other classes in the UAV dataset, Person and Structure, had very little data and were challenging for most models to classify correctly. In general, the models trained on UAV data performed similarly as the state-of-the-art models trained on automotive data.
@mastersthesis{diva2:1459609,
author = {Serra, Sabina},
title = {{Deep Learning for Semantic Segmentation of 3D Point Clouds from an Airborne LiDAR}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5331--SE}},
year = {2020},
address = {Sweden},
}
Autonomous cars are now becoming a reality, but there are still technical hurdles needed to be overcome for the technology to be safe and reliable. One of these issues is the cars’ ability to estimate braking distances. This function relies heavily on one parameter, friction. Friction is difficult to estimate for a car since the friction coefficient is dependent on both surfaces in contact - the tires and the road. This thesis presents anovel approach to the problem using a neural network classifier trained on features extracted from images of the road. One major advantage the presented method gives over the few but existing conventional methods is the ability to estimate friction on road segments ahead of the vehicle. This gives the vehicle time to slow down while the friction is still sufficient. The estimation pipeline performs significantly better than the baseline methods explored in the thesis and provides satisfying results which demonstrates its potential.
@mastersthesis{diva2:1454043,
author = {Svensson, Erik},
title = {{Transfer Learning for Friction Estimation:
Using Deep Reduced Features}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5312--SE}},
year = {2020},
address = {Sweden},
}
In digital image correlation, an optical full-field analysis method that can determine displacements of an object under load, high-resolution images are preferable. One way to improve the resolution is to improve the camera hardware. This can be expensive, hence another way to enhance the image is by various image processing techniques increase the resolution of the image. There are several ways of doing this and these techniques are called super-resolution. In this thesisthe theory behind several different approaches to super-resolution is presented and discussed. The goal of this Thesis has been to investigate if super-resolutionis possible in a scene with moving objects as well as movement of the camera. It became clear early on that image registration, a step in many super-resolution methods that will be explained in this thesis, was of utmost importance, and a major part of the work became comparing image registration methods. Data has been recorded and then two different super-resolution algorithms have been evaluated on a data set showing that super-resolution is possible.
@mastersthesis{diva2:1450740,
author = {Dahlström, Erik},
title = {{Super-Resolution Using Dynamic Cameras}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5315--SE}},
year = {2020},
address = {Sweden},
}
Recent improvements in pose estimation has opened up the possibility of new areas of application. One of them is gait recognition, the task of identifying persons based on their unique style of walking, which is increasingly being recognized as an important method of biometric indentification. This thesis has explored the possibilities of using a pose estimation system, OpenPose, together with deep Recurrent Neural Networks (RNNs) in order to see if there is sufficient information in sequences of 2D poses to use for gait recognition. For this to be possible, a new multi-camera dataset consisting of persons walking on a treadmill was gathered, dubbed the FOI dataset. The results show that this approach has some promise. It achieved an overall classification accuracy of 95,5 % on classes it had seen during training and 83,8 % for classes it had not seen during training. It was unable to recognize sequences from angles it had not seen during training, however. For that to be possible, more data pre-processing will likely be required.
@mastersthesis{diva2:1447593,
author = {Persson, Martin},
title = {{Automatic Gait Recognition:
using deep metric learning}},
school = {Linköping University},
type = {{LIU-ISY/LITH-EX-A--20/5316--SE}},
year = {2020},
address = {Sweden},
}
Object detection is a classical computer vision task, encountered in many practical applications such as robotics and autonomous driving. The latter involves serious consequences of failure and a multitude of challenging demands, including high computational efficiency and detection accuracy. Distant objects are notably difficult to detect accurately due to their small scale in the image, consisting of only a few pixels. This is especially problematic in autonomous driving, as objects should be detected at the earliest possible stage to facilitate handling of hazardous situations. Previous work has addressed small objects via use of feature pyramids and super-resolution techniques, but the efficiency of such methods is limited as computational cost increases with image resolution. Therefore, a trade-off must be made between accuracy and cost. Opportunely though, a common characteristic of driving scenarios is the predominance of distant objects in the centre of the image. Thus, the full-frame image can be downsampled to reduce computational cost, and a crop can be extracted from the image centre to preserve resolution for distant vehicles. In this way, short- and long-range images are generated. This thesis investigates the fusion of such images in a convolutional neural network, particularly the fusion level, fusion operation, and spatial alignment. A novel framework — DetSLR — is proposed for the task and examined via the aforementioned aspects. Through adoption of the framework for the well-established SSD detector and MobileNetV2 feature extractor, it is shown that the framework significantly improves upon the original detector without incurring additional cost. The fusion level is shown to have great impact on the performance of the framework, favouring high-level fusion, while only insignificant differences exist between investigated fusion operations. Finally, spatial alignment of features is demonstrated to be a crucial component of the framework.
@mastersthesis{diva2:1447580,
author = {Luusua, Emil},
title = {{Vehicle Detection, at a Distance:
Done Efficiently via Fusion of Short- and Long-Range Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5328--SE}},
year = {2020},
address = {Sweden},
}
In this thesis capsule networks are investigated, both theoretically and empirically. The properties of the dynamic routing [42] algorithm proposed for capsule networks, as well as a routing algorithm in a follow-up paper by Wang et al. [50] are thoroughly investigated. It is conjectured that there are three key attributes that are needed for a good routing algorithm, and these attributes are then related to previous algorithms. A novel routing algorithm EntMin is proposed based on the observations from the investigation of previous algorithms. A thorough evaluation of the performance of different aspects of capsule networks is conducted, and it is shown that EntMin outperforms both dynamic routing and Wang routing. Finally, a capsule network using EntMin routing is compared to a very deep Convolutional Neural Network and it is shown that it achieves comparable performance.
@mastersthesis{diva2:1445181,
author = {Edstedt, Johan},
title = {{Towards Understanding Capsule Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5309--SE}},
year = {2020},
address = {Sweden},
}
This thesis investigates the use of Generative Adversarial Networks (GANs) for detecting images containing non-natural objects in natural environments and if the introduction of stereo data can improve the performance. The state-of-the-art GAN-based anomaly detection method presented by A. Berget al. in [5] (BergGAN) was the base of this thesis. By modifiying BergGAN to not only accept three channel input, but also four and six channel input, it was possible to investigate the effect of introducing stereo data in the method. The input to the four channel network was an RGB image and its corresponding disparity map, and the input to the six channel network was a stereo pair consistingof two RGB images. The three datasets used in the thesis were constructed froma dataset of aerial video sequences provided by SAAB Dynamics, where the scene was mostly wooded areas. The datasets were divided into training and validation data, where the latter was used for the performance evaluation of the respective network. The evaluation method suggested in [5] was used in the thesis, where each sample was scored on the likelihood of it containing anomalies, Receiver Operating Characteristics (ROC) analysis was then applied and the area under the ROC-curve was calculated. The results showed that BergGAN was successfully able to detect images containing non-natural objects in natural environments using the dataset provided by SAAB Dynamics. The adaption of BergGAN to also accept four and six input channels increased the performance of the method, showing that there is information in stereo data that is relevant for GAN-based anomaly detection. There was however no substantial performance difference between the network trained with two RGB images versus the one trained with an RGB image and its corresponding disparity map.
@mastersthesis{diva2:1442532,
author = {Gehlin, Nils and Antonsson, Martin},
title = {{Detecting Non-Natural Objects in a Natural Environment using Generative Adversarial Networks with Stereo Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5324--SE}},
year = {2020},
address = {Sweden},
}
The field of autonomous driving is as active as it has ever been, but the reality where an autonomous vehicle can drive on all roads is currently decades away. Instead, using an on-the-fly learning method, such as qHebb learning, a system can,after some demonstration, learn the appearance of any road and take over the steering wheel. By training in a simulator, the amount and variation of training can increase substantially, however, an on-rails auto-pilot does not sufficiently populate the learning space of such a model. This study aims to explore concepts that can increase the variance in the training data whilst the vehicle trains online. Three computationally light concepts are proposed that each manages to result in a model that can navigate through a simple environment, thus performing better than a model trained solely on the auto-pilot. The most noteworthy approach uses multiple thresholds to detect when the vehicle deviates too much and replicates the action of a human correcting its trajectory. After training on less than 300 frames, a vehicle successfully completed the full test environment using this method.
@mastersthesis{diva2:1444702,
author = {Kindstedt, Mathias},
title = {{Exploring the Training Data for Online Learning of Autonomous Driving in a Simulated Environment}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5325--SE}},
year = {2020},
address = {Sweden},
}
Automatic Face Recognition (AFR) can be useful in the forensic field when identifying people in surveillance footage. In AFR systems it is common to use deep neural networks which perform well if the quality of the images keeps a certain level. This is a problem when applying AFR on surveillance data since the quality of those images can be very poor. In this thesis the CNN FaceNet has been used to evaluate how different quality parameters influence the accuracy of the face recognition. The goal is to be able to draw conclusions about how to improve the recognition by using and avoiding certain parameters based on the conditions. Parameters that have been experimented with are angle of the face, image quality, occlusion, colour and lighting. This has been achieved by using datasets with different properties or by alternating the images. The parameters are meant to simulate different situations that can occur in surveillance footage that is difficult for the network to recognise. Three different models have been evaluated with different amount of embeddings and different training data. The results show that the two models trained on the VGGFace2 dataset performs much better than the one trained on CASIA-WebFace. All models performance drops on images with low quality compared to images with high quality because of the training data including mostly high-quality images. In some cases, the recognition results can be improved by applying some alterations in the images. This could be by using one frontal and one profile image when trying to identify a person or occluding parts of the shape of the face if it gets recognized as other persons with similar face shapes. One main improvement would be to extend the training datasets with more low-quality images. To some extent, this could be achieved by different kinds of data augmentation like artificial occlusion and down-sampled images.
@mastersthesis{diva2:1444005,
author = {Tuvskog, Johanna},
title = {{Evaluation of Face Recognition Accuracy in Surveillance Video}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5302--SE}},
year = {2020},
address = {Sweden},
}
The main result of this thesis is a deep learning model named BearNet, which can be trained to detect an arbitrary amount of objects as a set of points. The model is trained using the Weighted Hausdorff distance as loss function. BearNet has been applied and tested on two problems from the industry. These are:
- From an intensity image, detect two pocket points of an EU-pallet which an autonomous forklift could utilize when determining where to insert its forks.
- From a depth image, detect the start, bend and end points of a straw attached to a juice package, in order to help determine if the straw has been attached correctly.
In the development process of BearNet I took inspiration from the designs of U-Net, UNet++ and a high resolution network named HRNet. Further, I used a dataset containing RGB-images from a surveillance camera located inside a mall, on which the aim was to detect head positions of all pedestrians. In an attempt to reproduce a result from another study, I found that the mall dataset suffers from training set contamination when a model is trained, validated, and tested on it with random sampling. Hence, I propose that the mall dataset is evaluated with a sequential data split strategy, to limit the problem.
I found that the BearNet architecture is well suited for both the EU-pallet and straw datasets, and that it can be successfully used on either RGB, intensity or depth images. On the EU-pallet and straw datasets, BearNet consistently produces point estimates within five and six pixels of ground truth, respectively. I also show that the straw dataset only constitutes a small subset of all the challenges that exist in the problem domain related to the attachment of a straw to a juice package, and that one therefore cannot train a robust deep learning model on it. As an example of this, models trained on the straw dataset cannot correctly handle samples in which there is no straw visible.
@mastersthesis{diva2:1442869,
author = {Runow, Björn},
title = {{Deep Learning for Point Detection in Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5295--SE}},
year = {2020},
address = {Sweden},
}
The process of locating moving objects through video sequences is a fundamental computer vision problem. This process is referred to as video tracking and has a broad range of applications. Even though video tracking is an open research topic that have received much attention during recent years, developing accurate and robust algorithms that can handle complicated tracking tasks and scenes is still challenging. One challenge in computer vision is to develop systems that like humans can understand, interpret and recognize visual information in different situations.
In this master thesis work, a tracking algorithm based on eye tracking data is proposed. The aim was to compare the tracking performance of the proposed algorithm with a state-of-the-art video tracker. The algorithm was tested on gaze signals from five participants recorded with an eye tracker while the participants were exposed to dynamic stimuli. The stimuli were moving objects displayed on a stationary computer screen. The proposed algorithm is working offline meaning that all data is collected before analysis.
The results show that the overall performance of the proposed eye tracking algorithm is comparable to the performance of a state-of-the-art video tracker. The main weaknesses are low accuracy for the proposed eye tracking algorithm and handling of occlusion for the video tracker. We also suggest a method for using eye tracking as a complement to object tracking methods. The results show that the eye tracker can be used in some situations to improve the tracking result of the video tracker. The proposed algorithm can be used to help the video tracker to redetect objects that have been occluded or for some other reason are not detected correctly. However, ATOM brings higher accuracy.
@mastersthesis{diva2:1435385,
author = {Ejnestrand, Ida and Jakobsson, Linn\'{e}a},
title = {{Object Tracking based on Eye Tracking Data:
A comparison with a state-of-the-art video tracker}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5294--SE}},
year = {2020},
address = {Sweden},
}
Previously well aligned image sensors, mounted on the same camera, might become misaligned depending on external vibrations. It is of interest to be able to automatically detect and correct for this misalignment, and to separate the deviation into pointing- and/or parallax errors. Two methods were evaluated for this purpose, an area based image registration method and a feature based image registration method. In the area based method normalized cross-correlation was used to estimate translation parameters. In the feature based method, SIFT or LIOP descriptors were used to extract features that were matched between the two image modalities to estimate transformation parameters. In both methods only image points that were in focus were extracted to avoid detection of false alignment deviations. The results indicate that the area based image registration method has potential to automatically detect and correct for an alignment deviation. Moreover, the area based method showed potential to separate the deviation into pointing errors and parallax errors. The feature based method was limited to specific scenes but could be used as a complement to the area based method in order to additionally correct for rotation and/or scaling.
@mastersthesis{diva2:1434095,
author = {Bjerwe, Ida},
title = {{Automatic Alignment Detection and Correction in Infrared and Visual Image Pairs}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5292--SE}},
year = {2020},
address = {Sweden},
}
In today’s society, we experience an increasing challenge to provide healthcare to everyone in need due to the increasing number of patients and the shortage of medical staff. Computers have contributed to mitigating this challenge by offloading the medical staff from some of the tasks. With the rise of deep learning, countless new possibilities have opened to help the medical staff even further. One domain where deep learning can be applied is analysis of ultrasound images. In this thesis we investigate the problem of classifying standard views of the heart in ultrasound images with the help of deep learning. We conduct mainly three experiments. First, we use NasNet mobile, InceptionV3, VGG16 and MobileNet, pre-trained on ImageNet, and finetune them to ultrasound heart images. We compare the accuracy of these networks to each other and to the baselinemodel, a CNN that was proposed in [23]. Then we assess a neural network’s capability to generalize to images from ultrasound machines that the network is not trained on. Lastly, we test how the performance of the networks degrades with decreasing amount of training data. Our first experiment shows that all networks considered in this study have very similar performance in terms of accuracy with Inception V3 being slightly better than the rest. The best performance is achieved when the whole network is finetuned to our problem instead of finetuning only apart of it, while gradually unlocking more layers for training. The generalization experiment shows that neural networks have the potential to generalize to images from ultrasound machines that they are not trained on. It also shows that having a mix of multiple ultrasound machines in the training data increases generalization performance. In our last experiment we compare the performance of the CNN proposed in [23] with MobileNet pre-trained on ImageNet and MobileNet randomly initialized. This shows that the performance of the baseline model suffers the least with decreasing amount of training data and that pre-training helps the performance drastically on smaller training datasets.
@mastersthesis{diva2:1425635,
author = {Pop, David},
title = {{Classification of Heart Views in Ultrasound Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5288--SE}},
year = {2020},
address = {Sweden},
}
When creating a photo realistic 3D model of the world using satellite imagery, image classification is an important part of the process. In this thesis the specificpart of automated building extraction is investigated. This is done by investi-gating the difference in performance between the methods instance segmentation and semantic segmentation for extraction of building footprints in orthorectified imagery. Semantic segmentation of the images is solved by using U-net, a Fully Convolutional Network that outputs a pixel-wise segmentation of the image. Instance segmentation of the images is done by a network called Mask R-CNN.The performance of the models are measured using precision, recall and the F1 score, which is the harmonic mean between precision and recall. The resulting F1 score of the two methods are similar, with U-net achieving a the F1 score of 0.684 without any post processing. Mask R-CNN achieves the F1 score of 0.676 without post processing.
@mastersthesis{diva2:1417200,
author = {Fritz, Karin},
title = {{Instance Segmentation of Buildings in Satellite Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5283--SE}},
year = {2020},
address = {Sweden},
}
The performance of conventional deep neural networks tends to degrade when a domain shift is introduced, such as collecting data from a new site. Model-Agnostic Meta-Learning, or MAML, has achieved state-of-the-art performance in few-shot learning by finding initial parameters that adapt easily for new tasks.
This thesis studies MAML in a digital pathology setting. Experiments show that a conventional model generalises poorly to data collected from another site. By annotating a few samples during inference however, a model with initial parameters obtained through MAML training can adapt to achieve better generalisation performance. It is also demonstrated that a simple transfer learning approach using a kNN classifier on features extracted from a conventional model yields good generalisation, but the variance caused by random sampling is higher.
The results indicate that meta learning can lead to a lower annotation effort for machine learning in digital pathology while maintaining accuracy.
@mastersthesis{diva2:1414984,
author = {Fagerblom, Freja},
title = {{Model-Agnostic Meta-Learning for Digital Pathology}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5284--SE}},
year = {2020},
address = {Sweden},
}
Classifying clothing attributes in surveillance images can be useful in the forensic field, making it easier to, for example, find suspects based on eyewitness accounts. Deep Neural Networks are often used successfully in image classification, but require a large amount of annotated data. Since labeling data can be time consuming or difficult, and it is easier to get hold of labeled fashion images, this thesis investigates how the domain shift from a fashion domain to a surveillance domain, with little or no annotated data, affects a classifier.
In the experiments, two deep networks of different depth are used as a base and trained on only fashion images as well as both labeled and unlabeled surveillance images, with and without domain adaptation regularizers. The surveillance dataset is new and consists of images that were collected from different surveillance cameras and annotated during this thesis work.
The results show that there is a degradation in performance for a classifier trained on the fashion domain when tested on the surveillance domain, compared to when tested on the fashion domain. The results also show that if no labeled data in the surveillance domain is used for these experiments, it is more effective to use the deeper network and train it on only fashion data, rather than to use the more complicated unsupervised domain adaptation method.
@mastersthesis{diva2:1392992,
author = {Härnström, Denise},
title = {{Classification of Clothing Attributes Across Domains}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5276--SE}},
year = {2020},
address = {Sweden},
}
Deep learning has been intensively researched in computer vision tasks like im-age classification. Collecting and labeling images that these neural networks aretrained on is labor-intensive, which is why alternative methods of collecting im-ages are of interest. Virtual environments allow rendering images and automaticlabeling, which could speed up the process of generating training data and re-duce costs.This thesis studies the problem of transfer learning in image classificationwhen the classifier has been trained on rendered images using a game engine andtested on real images. The goal is to render images using a game engine to createa classifier that can separate images depicting people wearing civilian clothingor camouflage. The thesis also studies how domain adaptation techniques usinggenerative adversarial networks could be used to improve the performance ofthe classifier. Experiments show that it is possible to generate images that canbe used for training a classifier capable of separating the two classes. However,the experiments with domain adaptation were unsuccessful. It is instead recom-mended to improve the quality of the rendered images in terms of features usedin the target domain to achieve better results.
@mastersthesis{diva2:1431281,
author = {Thornström, Johan},
title = {{Domain Adaptation of Unreal Images for Image Classification}},
school = {Linköping University},
type = {{LiTH-ISY-EX--20/5282--SE}},
year = {2019},
address = {Sweden},
}
In this thesis we investigate the use of GANs for texture enhancement. To achievethis, we have studied if synthetic satellite images generated by GANs will improvethe texture in satellite-based 3D maps.
We investigate two GANs; SRGAN and pix2pix. SRGAN increases the pixelresolution of the satellite images by generating upsampled images from low resolutionimages. As for pip2pix, the GAN performs image-to-image translation bytranslating a source image to a target image, without changing the pixel resolution.
We trained the GANs in two different approaches, named SAT-to-AER andSAT-to-AER-3D, where SAT, AER and AER-3D are different datasets provided bythe company Vricon. In the first approach, aerial images were used as groundtruth and in the second approach, rendered images from an aerial-based 3D mapwere used as ground truth.
The procedure of enhancing the texture in a satellite-based 3D map was dividedin two steps; the generation of synthetic satellite images and the re-texturingof the 3D map. Synthetic satellite images generated by two SRGAN models andone pix2pix model were used for the re-texturing. The best results were presentedusing SRGAN in the SAT-to-AER approach, in where the re-textured 3Dmap had enhanced structures and an increased perceived quality. SRGAN alsopresented a good result in the SAT-to-AER-3D approach, where the re-textured3D map had changed color distribution and the road markers were easier to distinguishfrom the ground. The images generated by the pix2pix model presentedthe worst result. As for the SAT-to-AER approach, even though the syntheticsatellite images generated by pix2pix were somewhat enhanced and containedless noise, they had no significant impact in the re-texturing. In the SAT-to-AER-3D approach, none of the investigated models based on the pix2pix frameworkpresented any successful results.
We concluded that GANs can be used as a texture enhancer using both aerialimages and images rendered from an aerial-based 3D map as ground truth. Theuse of GANs as a texture enhancer have great potential and have several interestingareas for future works.
@mastersthesis{diva2:1375054,
author = {Birgersson, Anna and Hellgren, Klara},
title = {{Texture Enhancement in 3D Maps using Generative Adversarial Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5266--SE}},
year = {2019},
address = {Sweden},
}
The organization International Aid Services (IAS) provides people in East Africawith clean water through well drilling. The wells are located in surroundingsfar away for the investors to inspect and therefore IAS wishes to be able to monitortheir wells to get a better overview if different types of improvements needto be made. To see the load on different water sources at different times of theday and during the year, and to know how many people that are visiting thewells, is of particular interest. In this paper, a method is proposed for countingpeople around the wells. The goal is to choose a suitable method for detectinghumans in images and evaluate how it performs. The area of counting humansin images is not a new topic, though it needs to be taken into account that thesituation implies some restrictions. A Raspberry Pi with an associated camerais used, which is a small embedded system that cannot handle large and complexsoftware. There is also a limited amount of data in the project. The methodproposed in this project uses a pre-trained convolutional neural network basedobject detector called the Single Shot Detector, which is adapted to suit smallerdevices and applications. The pre-trained network that it is based on is calledMobileNet, a network that is developed to be used on smaller systems. To see howgood the chosen detector performs it will be compared with some other models.Among them a detector based on the Inception network, a significantly larger networkthan the MobileNet. The base network is modified by transfer learning.Results shows that a fine-tuned and modified network can achieve better result,from a F1-score of 0.49 for a non-fine-tuned model to 0.66 for the fine-tuned one.
@mastersthesis{diva2:1352472,
author = {Kastberg, Maria},
title = {{Using Convolutional Neural Networks to Detect People Around Wells in South Sudan}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5200--SE}},
year = {2019},
address = {Sweden},
}
For a long time stereo-cameras have been deployed in visual Simultaneous Localization And Mapping (SLAM) systems to gain 3D information. Even though stereo-cameras show good performance, the main disadvantage is the complex and expensive hardware setup it requires, which limits the use of the system. A simpler and cheaper alternative are monocular cameras, however monocular images lack the important depth information. Recent works have shown that having access to depth maps in monocular SLAM system is beneficial since they can be used to improve the 3D reconstruction. This work proposes a deep neural network that predicts dense high-resolution depth maps from monocular RGB images by casting the problem as a supervised regression task. The network architecture follows an encoder-decoder structure in which multi-scale information is captured and skip-connections are used to recover details. The network is trained and evaluated on the KITTI dataset achieving results comparable to state-of-the-art methods. With further development, this network shows good potential to be incorporated in a monocular SLAM system to improve the 3D reconstruction.
@mastersthesis{diva2:1347284,
author = {Larsson, Susanna},
title = {{Monocular Depth Estimation Using Deep Convolutional Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5234--SE}},
year = {2019},
address = {Sweden},
}
Given satellite images with accompanying pixel classifications and elevation data, we propose different solutions to object detection. The first method uses hierarchical clustering for segmentation and then employs different methods of classification. One of these classification methods used domain knowledge to classify objects while the other used Support Vector Machines. Additionally, a combination of three Support Vector Machines were used in a hierarchical structure which out-performed the regular Support Vector Machine method in most of the evaluation metrics. The second approach is more conventional with different types of Convolutional Neural Networks. A segmentation network was used as well as a few detection networks and different fusions between these. The Convolutional Neural Network approach proved to be the better of the two in terms of precision and recall but the clustering approach was not far behind. This work was done using a relatively small amount of data which potentially could have impacted the results of the Machine Learning models in a negative way.
@mastersthesis{diva2:1346426,
author = {Grahn, Fredrik and Nilsson, Kristian},
title = {{Object Detection in Domain Specific Stereo-Analysed Satellite Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5254--SE}},
year = {2019},
address = {Sweden},
}
Watermarking is a technique used to used to mark the ownership in media such as audio or images by embedding a watermark, e.g. copyrights information, into the media. A good watermarking method should perform this embedding without affecting the quality of the media. Recent methods for watermarking in images uses deep learning to embed and extract the watermark in the images. In this thesis, we investigate watermarking in the hearable frequencies of audio using deep learning. More specifically, we try to create a watermarking method for audio that is robust to noise in the carrier, and that allows for the extraction of the embedded watermark from the audio after being played over-the-air. The proposed method consists of two deep convolutional neural network trained end-to-end on music with simulated noise. Experiments show that the proposed method successfully creates watermarks robust to simulated noise with moderate quality reductions, but it is not robust to the real world noise introduced after playing and recording the audio over-the-air.
@mastersthesis{diva2:1340077,
author = {Tegendal, Lukas},
title = {{Watermarking in Audio using Deep Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5246--SE}},
year = {2019},
address = {Sweden},
}
Finding disparity maps between stereo images is a well studied topic within computer vision. While both classical and machine learning approaches exist in the literature, they frequently struggle to correctly solve the disparity in regions with low texture, sharp edges or occlusions. Finding approximate solutions to these problem areas is frequently referred to as disparity refinement, and is usually carried out separately after an initial disparity map has been generated.
In the recent literature, the use of Normalized Convolution in Convolutional Neural Networks have shown remarkable results when applied to the task of stereo depth completion. This thesis investigates how well this approach performs in the case of disparity refinement. Specifically, we investigate how well such a method can improve the initial disparity maps generated by the stereo matching algorithm developed at Saab Dynamics using a rectified stereo rig.
To this end, a dataset of ground truth disparity maps was created using equipment at Saab, namely a setup for structured light and the stereo rig cameras. Because the end goal is a dataset fit for training networks, we investigate an approach that allows for efficient creation of significant quantities of dense ground truth disparities.
The method for generating ground truth disparities generates several disparity maps for every scene measured by using several stereo pairs. A densified disparity map is generated by merging the disparity maps from the neighbouring stereo pairs. This resulted in a dataset of 26 scenes and 104 dense and accurate disparity maps.
Our evaluation results show that the chosen Normalized Convolution Network based method can be adapted for disparity map refinement, but is dependent on the quality of the input disparity map.
@mastersthesis{diva2:1333176,
author = {Cranston, Daniel and Skarfelt, Filip},
title = {{Normalized Convolution Network and Dataset Generation for Refining Stereo Disparity Maps}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5252--SE}},
year = {2019},
address = {Sweden},
}
When subtitles are burned into a video, an error can sometimes occur in the encoder that results in the same subtitle being burned into several frames, resulting in subtitles becoming frozen. This thesis provides a way to detect frozen video subtitles with the help of an implemented text detector and classifier.
Two types of classifiers, naïve classifiers and machine learning classifiers, are tested and compared on a variety of different videos to see how much a machine learning approach can improve the performance. The naïve classifiers are evaluated using ground truth data to gain an understanding of the importance of good text detection. To understand the difficulty of the problem, two different machine learning classifiers are tested, logistic regression and random forests.
The result shows that machine learning improves the performance over using naïve classifiers by improving the specificity from approximately 87.3% to 95.8% and improving the accuracy from 93.3% to 95.5%. Random forests achieve the best overall performance, but the difference compared to when using logistic regression is small enough that more computationally complex machine learning classifiers are not necessary. Using the ground truth shows that the weaker naïve classifiers would be improved by at least 4.2% accuracy, thus a better text detector is warranted. This thesis shows that machine learning is a viable option for detecting frozen video subtitles.
@mastersthesis{diva2:1331490,
author = {Sjölund, Jonathan},
title = {{Detection of Frozen Video Subtitles Using Machine Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5206--SE}},
year = {2019},
address = {Sweden},
}
One fundamental task in robotics is random bin-picking, where it is important to be able to detect an object in a bin and estimate its pose to plan the motion of a robotic arm. For this purpose, this thesis work aimed to investigate and evaluate algorithms for 6D pose estimation when the object was given by a CAD model. The scene was given by a point cloud illustrating a partial 3D view of the bin with multiple instances of the object. Two algorithms were thus implemented and evaluated. The first algorithm was an approach based on Point Pair Features, and the second was Fast Global Registration. For evaluation, four different CAD models were used to create synthetic data with ground truth annotations.
It was concluded that the Point Pair Feature approach provided a robust localization of objects and can be used for bin-picking. The algorithm appears to be able to handle different types of objects, however, with small limitations when the object has flat surfaces and weak texture or many similar details. The disadvantage with the algorithm was the execution time. Fast Global Registration, on the other hand, did not provide a robust localization of objects and is thus not a good solution for bin-picking.
@mastersthesis{diva2:1330419,
author = {Lef, Annette},
title = {{CAD-Based Pose Estimation - Algorithm Investigation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5239--SE}},
year = {2019},
address = {Sweden},
}
In recent years semantic segmentation models utilizing Convolutional Neural Networks (CNN) have seen significant success for multiple different segmentation problems. Models such as U-Net have produced promising results within the medical field for both regular 2D and volumetric imaging, rivalling some of the best classical segmentation methods.
In this thesis we examined the possibility of using a convolutional neural network-based model to perform segmentation of discrete bone fragments in CT-volumes with segmentation-hints provided by a user. We additionally examined different classical segmentation methods used in a post-processing refinement stage and their effect on the segmentation quality. We compared the performance of our model to similar approaches and provided insight into how the interactive aspect of the model affected the quality of the result.
We found that the combined approach of interactive segmentation and deep learning produced results on par with some of the best methods presented, provided there were adequate amount of annotated training data. We additionally found that the number of segmentation hints provided to the model by the user significantly affected the quality of the result, with convergence of the result around 8 provided hints.
@mastersthesis{diva2:1326942,
author = {Estgren, Martin},
title = {{Bone Fragment Segmentation Using Deep Interactive Object Selection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5197--SE}},
year = {2019},
address = {Sweden},
}
Semantic segmentation is a key approach to comprehensive image data analysis. It can be applied to analyze 2D images, videos, and even point clouds that contain 3D data points. On the first two problems, CNNs have achieved remarkable progress, but on point cloud segmentation, the results are less satisfactory due to challenges such as limited memory resource and difficulties in 3D point annotation. One of the research studies carried out by the Computer Vision Lab at Linköping University was aiming to ease the semantic segmentation of 3D point cloud. The idea is that by first projecting 3D data points to 2D space and then focusing only on the analysis of 2D images, we can reduce the overall workload for the segmentation process as well as exploit the existing well-developed 2D semantic segmentation techniques. In order to improve the performance of CNNs for 2D semantic segmentation, the study has used input data derived from different modalities. However, how different modalities can be optimally fused is still an open question. Based on the above-mentioned study, this thesis aims to improve the multistream framework architecture. More concretely, we investigate how different singlestream architectures impact the multistream framework with a given fusion method, and how different fusion methods contribute to the overall performance of a given multistream framework. As a result, our proposed fusion architecture outperformed all the investigated traditional fusion methods. Along with the best singlestream candidate and few additional training techniques, our final proposed multistream framework obtained a relative gain of 7.3\% mIoU compared to the baseline on the semantic3D point cloud test set, increasing the ranking from 12th to 5th position on the benchmark leaderboard.
@mastersthesis{diva2:1327473,
author = {He, Linbo},
title = {{Improving 3D Point Cloud Segmentation Using Multimodal Fusion of Projected 2D Imagery Data:
Improving 3D Point Cloud Segmentation Using Multimodal Fusion of Projected 2D Imagery Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5190--SE}},
year = {2019},
address = {Sweden},
}
Multiple object tracking is the process of assigning unique and consistent identities to objects throughout a video sequence. A popular approach to multiple object tracking, and object tracking in general, is to use a method called tracking-by-detection. Tracking-by-detection is a two-stage procedure: an object detection algorithm first detects objects in a frame, these objects are then associated with already tracked objects by a tracking algorithm. One of the main concerns of this thesis is to investigate how different object detection algorithms perform on surveillance video supplied by National Forensic Centre. The thesis then goes on to explore how the stand-alone alone performance of the object detection algorithm correlates with overall performance of a tracking-by-detection system. Finally, the thesis investigates how the use of visual descriptors in the tracking stage of a tracking-by-detection system effects performance.
Results presented in this thesis suggest that the capacity of the object detection algorithm is highly indicative of the overall performance of the tracking-by-detection system. Further, this thesis also shows how the use of visual descriptors in the tracking stage can reduce the number of identity switches and thereby increase performance of the whole system.
@mastersthesis{diva2:1326842,
author = {Nyström, Axel},
title = {{Evaluation of Multiple Object Tracking in Surveillance Video}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5245--SE}},
year = {2019},
address = {Sweden},
}
The interest for autonomous driving assistance, and in the end, self-driving cars, has increased vastly over the last decade. Automotive safety continues to be a priority for manufacturers, politicians and people alike. Visual-based systems aiding the drivers have lately been boosted by advances in computer vision and machine learning. In this thesis, we evaluate the concept of an end-to-end machine learning solution for detecting and classifying road lane markings, and compare it to a more classical semantic segmentation solution. The analysis is based on the frame-by-frame scenario, and shows that our proposed end-to-end system has clear advantages when it comes detecting the existence of lanes and producing a consistent, lane-like output, especially in adverse conditions such as weak lane markings. Our proposed method allows the system to predict its own confidence, thereby allowing the system to suppress its own output when it is not deemed safe enough. The thesis finishes with proposed future work needed to achieve optimal performance and create a system ready for deployment in an active safety product.
@mastersthesis{diva2:1326388,
author = {Vigren, Malcolm and Eriksson, Linus},
title = {{End-to-End Road Lane Detection and Estimation using Deep Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5219--SE}},
year = {2019},
address = {Sweden},
}
In large scale productions of metal sheets, it is important to maintain an effective way to continuously inspect the products passing through the production line. The inspection mainly consists of detection of defects and tracking of ID numbers. This thesis investigates the possibilities to create an automatic inspection system by evaluating different machine learning algorithms for defect detection and optical character recognition (OCR) on metal sheet data. Digit recognition and defect detection are solved separately, where the former compares the object detection algorithm Faster R-CNN and the classical machine learning algorithm NCGF, and the latter is based on unsupervised learning using a convolutional autoencoder (CAE).
The advantage of the feature extraction method is that it only needs a couple of samples to be able to classify new digits, which is desirable in this case due to the lack of training data. Faster R-CNN, on the other hand, needs much more training data to solve the same problem. NCGF does however fail to classify noisy images and images of metal sheets containing an alloy, while Faster R-CNN seems to be a more promising solution with a final mean average precision of 98.59%.
The CAE approach for defect detection showed promising result. The algorithm learned how to only reconstruct images without defects, resulting in reconstruction errors whenever a defect appears. The errors are initially classified using a basic thresholding approach, resulting in a 98.9% accuracy. However, this classifier requires supervised learning, which is why the clustering algorithm Gaussian mixture model (GMM) is investigated as well. The result shows that it should be possible to use GMM, but that it requires a lot of GPU resources to use it in an end-to-end solution with a CAE.
@mastersthesis{diva2:1325083,
author = {Grönlund, Jakob and Johansson, Angelina},
title = {{Defect Detection and OCR on Steel}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5220--SE}},
year = {2019},
address = {Sweden},
}
Traffic sign recognition is an important problem for autonomous cars and driver assistance systems. With recent developments in the field of machine learning, high performance can be achieved, but typically at a large computational cost.
This thesis aims to investigate the relation between classification accuracy and computational complexity for the visual recognition problem of classifying traffic signs. In particular, the benefits of partitioning the classification problem into smaller sub-problems using prior knowledge in the form of shape or current region are investigated.
In the experiments, the convolutional neural network (CNN) architecture MobileNetV2 is used, as it is specifically designed to be computationally efficient. To incorporate prior knowledge, separate CNNs are used for the different subsets generated when partitioning the dataset based on region or shape. The separate CNNs are trained from scratch or initialized by pre-training on the full dataset.
The results support the intuitive idea that performance initially increases with network size and indicate a network size where the improvement stops. Including shape information using the two investigated methods does not result in a significant improvement. Including region information using pretrained separate classifiers results in a small improvement for small complexities, for one of the regions in the experiments.
In the end, none of the investigated methods of including prior knowledge are considered to yield an improvement large enough to justify the added implementational complexity. However, some other methods are suggested, which would be interesting to study in future work.
@mastersthesis{diva2:1324051,
author = {Ekman, Carl},
title = {{Traffic Sign Classification Using Computationally Efficient Convolutional Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5216--SE}},
year = {2019},
address = {Sweden},
}
This report is the result of a master thesis made by two students at Linköping University. The aim was to find an image registration method for visual and infrared images and to find an error measure for grading the registration performance. In practice this could be used for position determination by registering the infrared image taken at the current position to a set of visual images with known positions and determining which visual image matches the best. Two methods were tried, using different image feature extractors and different ways to match the features. The first method used phase information in the images to generate soft features and then minimised the square error of the optical flow equation to estimate the transformation between the visual and infrared image. The second method used the Canny edge detector to extract hard features from the images and Chamfer distance as an error measure. Both methods were evaluated for registration as well as position determination and yielded promising results. However, the performance of both methods was image dependent. The soft edge method proved to be more robust and precise and worked better than the hard edge method for both registration and position determination.
@mastersthesis{diva2:1323680,
author = {Fridman, Linnea and Nordberg, Victoria},
title = {{Two Multimodal Image Registration Approaches for Positioning Purposes}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5208--SE}},
year = {2019},
address = {Sweden},
}
Recently, the deep neural network structure caps-net was proposed by Sabouret al. [11]. Capsule networks are designed to learn relative geometry betweenthe features of a layer and the features of the next layer. The Capsule network’smain building blocks are capsules, which are represented by vectors. The ideais that each capsule will represent a feature as well as traits or subfeatures ofthat feature. This allows for smart information routing. Capsules traits are usedto predict the traits of the capsules in the next layer, and information is sent toto next layer capsules on which the predictions agree. This is called routing byagreement.This thesis investigates theoretical support of new and existing routing al-gorithms as well as evaluates their performance on the MNIST [16] and CIFAR-10 [8] datasets. A variation of the dynamic routing algorithm presented in theoriginal paper [11] achieved the highest accuracy and fastest execution time.
@mastersthesis{diva2:1314210,
author = {Malmgren, Christoffer},
title = {{A Comparative Study of Routing Methods in Capsule Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5188--SE}},
year = {2019},
address = {Sweden},
}
Visual object detection is a popular computer vision task that has been intensively investigated using deep learning on real data. However, data from virtual environments have not received the same attention. A virtual environment enables generating data for locations that are not easily reachable for data collection, e.g. aerial environments. In this thesis, we study the problem of object detection in virtual environments, more specifically an aerial virtual environment. We use a simulator, to generate a synthetic data set of 16 different types of vehicles captured from an airplane.
To study the performance of existing methods in virtual environments, we train and evaluate two state-of-the-art detectors on the generated data set. Experiments show that both detectors, You Only Look Once version 3 (YOLOv3) and Single Shot MultiBox Detector (SSD), reach similar performance quality as previously presented in the literature on real data sets.
In addition, we investigate different fusion techniques between detectors which were trained on two different subsets of the dataset, in this case a subset which has cars with fixed colors and a dataset which has cars with varying colors. Experiments show that it is possible to train multiple instances of the detector on different subsets of the data set, and combine these detectors in order to boost the performance.
@mastersthesis{diva2:1307568,
author = {Norrstig, Andreas},
title = {{Visual Object Detection using Convolutional Neural Networks in a Virtual Environment}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5195--SE}},
year = {2019},
address = {Sweden},
}
Visual object tracking is one of the fundamental problems in computer vision, with a wide number of practical applications in e.g.\ robotics, surveillance etc. Given a video sequence and the target bounding box in the first frame, a tracker is required to find the target in all subsequent frames. It is a challenging problem due to the limited training data available. An object tracker is generally evaluated using two criterias, namely robustness and accuracy. Robustness refers to the ability of a tracker to track for long durations, without losing the target. Accuracy, on the other hand, denotes how accurately a tracker can estimate the target bounding box.
Recent years have seen significant improvement in tracking robustness. However, the problem of accurate tracking has seen less attention. Most current state-of-the-art trackers resort to a naive multi-scale search strategy which has fundamental limitations. Thus, in this thesis, we aim to develop a general target estimation component which can be used to determine accurate bounding box for tracking. We will investigate how bounding box estimators used in object detection can be modified to be used for object tracking. The key difference between detection and tracking is that in object detection, the classes to which the objects belong are known. However, in tracking, no prior information is available about the tracked object, other than a single image provided in the first frame. We will thus investigate different architectures to utilize the first frame information to provide target specific bounding box predictions. We will also investigate how the bounding box predictors can be integrated into a state-of-the-art tracking method to obtain robust as well as accurate tracking.
@mastersthesis{diva2:1291564,
author = {Bhat, Goutam},
title = {{Accurate Tracking by Overlap Maximization}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5189--SE}},
year = {2019},
address = {Sweden},
}
Visual Simultaneous Localization And Mapping (SLAM) allows for three dimensionalreconstruction from a camera’s output and simultaneous positioning of the camera withinthe reconstruction. With use cases ranging from autonomous vehicles to augmentedreality, the SLAM field has garnered interest both commercially and academically.
A SLAM system performs odometry as it estimates the camera’s movement throughthe scene. The incremental estimation of odometry is not error free and exhibits driftover time with map inconsistencies as a result. Detecting the return to a previously seenplace, a loop, means that this new information regarding our position can be incorporatedto correct the trajectory retroactively. Loop detection can also facilitate relocalization ifthe system loses tracking due to e.g. heavy motion blur.
This thesis proposes an odometric system making use of bundle adjustment within akeyframe based stereo SLAM application. This system is capable of detecting loops byutilizing the algorithm FAB-MAP. Two aspects of this system is evaluated, the odometryand the capability to relocate. Both of these are evaluated using the EuRoC MAV dataset,with an absolute trajectory RMS error ranging from 0.80 m to 1.70 m for the machinehall sequences.
The capability to relocate is evaluated using a novel methodology that intuitively canbe interpreted. Results are given for different levels of strictness to encompass differentuse cases. The method makes use of reprojection of points seen in keyframes to definewhether a relocalization is possible or not. The system shows a capability to relocate inup to 85% of all cases when a keyframe exists that can project 90% of its points intothe current view. Errors in estimated poses were found to be correlated with the relativedistance, with errors less than 10 cm in 23% to 73% of all cases.
The evaluation of the whole system is augmented with an evaluation of local imagedescriptors and pose estimation algorithms. The descriptor SIFT was found to performbest overall, but demanding to compute. BRISK was deemed the best alternative for afast yet accurate descriptor.
Conclusions that can be drawn from this thesis is that FAB-MAP works well fordetecting loops as long as the addition of keyframes is handled appropriately.
@mastersthesis{diva2:1287320,
author = {Ringdahl, Viktor},
title = {{Stereo Camera Pose Estimation to Enable Loop Detection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--19/5186--SE}},
year = {2019},
address = {Sweden},
}
This thesis presents and evaluates different methods to semantically segment 3D-models by rendered 2D-views. The 2D-views are segmented separately and then merged together. The thesis evaluates three different merge strategies, two different classification architectures, how many views should be rendered and how these rendered views should be arranged. The results are evaluated both quantitatively and qualitatively and then compared with the current classifier at Vricon presented in [30].
The conclusion of this thesis is that there is a performance gain to be had using this method. The best model was using two views and attains an accuracy of 90.89% which can be compared with 84.52% achieved by the single view network from [30]. The best nine view system achieved a 87.72%. The difference in accuracy between the two and the nine view system is attributed to the higher quality mesh on the sunny side of objects, which typically is the south side.
The thesis provides a proof of concept and there are still many areas where the system can be improved. One of them being the extraction of training data which seemingly would have a huge impact on the performance.
@mastersthesis{diva2:1278684,
author = {Tranell, Victor},
title = {{Semantic Segmentation of Oblique Views in a 3D-Environment}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5185--SE}},
year = {2019},
address = {Sweden},
}
Visual tracking concerns the problem of following an arbitrary object in a video sequence. In this thesis, we examine how to use stereo images to extend existing visual tracking algorithms, which methods exists to obtain information from stereo images, and how the results change as the parameters to each tracker vary. For this purpose, four abstract approaches are identified, with five distinct implementations. Each tracker implementation is an extension of a baseline algorithm, MOSSE. The free parameters of each model are optimized with respect to two different evaluation strategies called nor- and wir-tests, and four different objective functions, which are then fixed when comparing the models against each other. The results are created on single target tracks extracted from the KITTI tracking dataset, and the optimization results show that none of the objective functions are sensitive to the exposed parameters under the joint selection of model and dataset. The evaluation results also shows that none of the extensions improve the results of the baseline tracker.
@mastersthesis{diva2:1277154,
author = {Dehlin, Carl},
title = {{Visual Tracking Using Stereo Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5181--SE}},
year = {2019},
address = {Sweden},
}
Visual tracking is a computer vision problem where the task is to follow a targetthrough a video sequence. Tracking has many important real-world applications in several fields such as autonomous vehicles and robot-vision. Since visual tracking does not assume any prior knowledge about the target, it faces different challenges such occlusion, appearance change, background clutter and scale change. In this thesis we try to improve the capabilities of tracking frameworks using discriminative correlation filters by incorporating scene depth information. We utilize scene depth information on three main levels. First, we use raw depth information to segment the target from its surroundings enabling occlusion detection and scale estimation. Second, we investigate different visual features calculated from depth data to decide which features are good at encoding geometric information available solely in depth data. Third, we investigate handling missing data in the depth maps using a modified version of the normalized convolution framework. Finally, we introduce a novel approach for parameter search using genetic algorithms to find the best hyperparameters for our tracking framework. Experiments show that depth data can be used to estimate scale changes and handle occlusions. In addition, visual features calculated from depth are more representative if they were combined with color features. It is also shown that utilizing normalized convolution improves the overall performance in some cases. Lastly, the usage of genetic algorithms for hyperparameter search leads to accuracy gains as well as some insights on the performance of different components within the framework.
@mastersthesis{diva2:1266346,
author = {Stynsberg, John},
title = {{Incorporating Scene Depth in Discriminative Correlation Filters for Visual Tracking}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5178--SE}},
year = {2018},
address = {Sweden},
}
Training data is the bottleneck for training Convolutional Neural Networks. A larger dataset gives better accuracy though also needs longer training time. It is shown by finetuning neural networks on synthetic rendered images, that the mean average precision increases. This method was applied to two different datasets with five distinctive objects in each. The first dataset consisted of random objects with different geometric shapes. The second dataset contained objects used to assemble IKEA furniture. The neural network with the best performance, trained on 5400 images, achieved a mean average precision of 0.81 on a test which was a sample of a video sequence. Analysis of the impact of the factors dataset size, batch size, and numbers of epochs used in training and different network architectures were done. Using synthetic images to train CNN’s is a promising path to take for object detection where access to large amount of annotated image data is hard to come by.
@mastersthesis{diva2:1267446,
author = {Vi, Margareta},
title = {{Object Detection Using Convolutional Neural Network Trained on Synthetic Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5180--SE}},
year = {2018},
address = {Sweden},
}
In recent years, development of Convolutional Neural Networks has enabled high performing semantic segmentation models. Generally, these deep learning based segmentation methods require a large amount of annotated data. Acquiring such annotated data for semantic segmentation is a tedious and expensive task.
Within machine learning, active learning involves in the selection of new data in order to limit the usage of annotated data. In active learning, the model is trained for several iterations and additional samples are selected that the model is uncertain of. The model is then retrained on additional samples and the process is repeated again. In this thesis, an active learning framework has been applied to road segmentation which is semantic segmentation of objects related to road scenes.
The uncertainty in the samples is estimated with Monte Carlo dropout. In Monte Carlo dropout, several dropout masks are applied to the model and the variance is captured, working as an estimate of the model’s uncertainty. Other metrics to rank the uncertainty evaluated in this work are: a baseline method that selects samples randomly, the entropy in the default predictions and three additional variations/extensions of Monte Carlo dropout.
Both the active learning framework and uncertainty estimation are implemented in the thesis. Monte Carlo dropout performs slightly better than the baseline in 3 out of 4 metrics. Entropy outperforms all other implemented methods in all metrics. The three additional methods do not perform better than Monte Carlo dropout.
An analysis of what kind of uncertainty Monte Carlo dropout capture is performed together with a comparison of the samples selected by baseline and Monte Carlo dropout. Future development and possible improvements are also discussed.
@mastersthesis{diva2:1259079,
author = {Sörsäter, Michael},
title = {{Active Learning for Road Segmentation using Convolutional Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5176--SE}},
year = {2018},
address = {Sweden},
}
This master thesis explores the possibility of using generative Adversarial Networks (GANs) to refine labeled synthetic code images to resemble real code images while preserving label information. The GAN used in this thesis consists of a refiner and a discriminator. The discriminator tries to distinguish between real images and refined synthetic images. The refiner tries to fool the discriminator by producing refined synthetic images such that the discriminator classify them as real. By updating these two networks iteratively, the idea is that they will push each other to get better, resulting in refined synthetic images with real image characteristics.
The aspiration, if the exploration of GANs turns out successful, is to be able to use refined synthetic images as training data in Semantic Segmentation (SS) tasks and thereby eliminate the laborious task of gathering and labeling real data. Starting off from a foundational GAN-model, different network architectures, hyperparameters and other design choices are explored to find the best performing GAN-model.
As is widely acknowledged in the relevant literature, GANs can be difficult to train and the results in this thesis are varying and sometimes ambiguous. Based on the results from this study, the best performing models do however perform better in SS tasks than the unrefined synthetic set they are based on and benchmarked against, with regards to Intersection over Union.
@mastersthesis{diva2:1254973,
author = {Stenhagen, Petter},
title = {{Improving Realism in Synthetic Barcode Images using Generative Adversarial Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5169--SE}},
year = {2018},
address = {Sweden},
}
Thermal spectrum cameras are gaining interest in many applications due to their long wavelength which allows them to operate under low light and harsh weather conditions. One disadvantage of thermal cameras is their limited visual interpretability for humans, which limits the scope of their applications. In this thesis, we try to address this problem by investigating the possibility of transforming thermal infrared (TIR) images to perceptually realistic visible spectrum (VIS) images by using Convolutional Neural Networks (CNNs). Existing state-of-the-art colorization CNNs fail to provide the desired output as they were trained to map grayscale VIS images to color VIS images. Instead, we utilize an auto-encoder architecture to perform cross-spectral transformation between TIR and VIS images. This architecture was shown to quantitatively perform very well on the problem while producing perceptually realistic images. We show that the quantitative differences are insignificant when training this architecture using different color spaces, while there exist clear qualitative differences depending on the choice of color space. Finally, we found that a CNN trained from daytime examples generalizes well on tests from night time.
@mastersthesis{diva2:1255342,
author = {Nyberg, Adam},
title = {{Transforming Thermal Images to Visible Spectrum Images Using Deep Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5167--SE}},
year = {2018},
address = {Sweden},
}
Recently, sensors such as radars and cameras have been widely used in automotives, especially in Advanced Driver-Assistance Systems (ADAS), to collect information about the vehicle's surroundings. Stereo cameras are very popular as they could be used passively to construct a 3D representation of the scene in front of the car. This allowed the development of several ADAS algorithms that need 3D information to perform their tasks. One interesting application is Road Surface Preview (RSP) where the task is to estimate the road height along the future path of the vehicle. An active suspension control unit can then use this information to regulate the suspension, improving driving comfort, extending the durabilitiy of the vehicle and warning the driver about potential risks on the road surface. Stereo cameras have been successfully used in RSP and have demonstrated very good performance. However, the main disadvantages of stereo cameras are their high production cost and high power consumption. This limits installing several ADAS features in economy-class vehicles. A less expensive alternative are monocular cameras which have a significantly lower cost and power consumption. Therefore, this thesis investigates the possibility of solving the Road Surface Preview task using a monocular camera. We try two different approaches: structure-from-motion and Convolutional Neural Networks.The proposed methods are evaluated against the stereo-based system. Experiments show that both structure-from-motion and CNNs have a good potential for solving the problem, but they are not yet reliable enough to be a complete solution to the RSP task and be used in an active suspension control unit.
@mastersthesis{diva2:1253882,
author = {Ekström, Marcus},
title = {{Road Surface Preview Estimation Using a Monocular Camera}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5173--SE}},
year = {2018},
address = {Sweden},
}
The purpose of the thesis was to investigate the possibility of using machine learn-ing for automation of liver fat measurements in fat-water magnetic resonancei maging (MRI). The thesis presents methods for texture based liver classificationand Proton Density Fat Fraction (PDFF) regression using multi-layer perceptrons utilizing 2D and 3D textural image features. The first proposed method was a data classification method with the goal to distinguish between suitable andunsuitable regions to measure PDFF in. The second proposed method was a combined classification and regression method where the classification distinguishes between liver and non-liver tissue. The goal of the regression model was to predict the difference d = pdff mean − pdff ROI between the manual ground truth mean and the fat fraction of the active Region of Interest (ROI).Tests were performed on varying sizes of Image Feature Regions (froi) and combinations of image features on both of the proposed methods. The tests showed that 3D measurements using image features from discrete wavelet transforms produced measurements similar to the manual fat measurements. The first method resulted in lower relative errors while the second method had a higher method agreement compared to manual measurements.
@mastersthesis{diva2:1248500,
author = {Grundström, Tobias},
title = {{Automated Measurements of Liver Fat Using Machine Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5166--SE}},
year = {2018},
address = {Sweden},
}
Robotic bin picking is the problem of emptying a bin of randomly distributedobjects through a robotic interface. This thesis examines an SVM approach to ex-tract grasping points for a vacuum-type gripper. The SVM is trained on syntheticdata and used to classify the points of a non-synthetic 3D-scanned point cloud aseither graspable or non-graspable. The classified points are then clustered intograspable regions from which the grasping points are extracted.
The SVM models and the algorithm as a whole are trained and evaluated againstcubic and cylindrical objects. Separate SVM models are trained for each type ofobject in addition to one model being trained on a dataset containing both typesof objects. It is shown that the performance of the SVM in terms accuracy isdependent on the objects and their geometrical properties. Further, it is shownthat the algorithm is reasonably robust in terms of successfully picking objects,regardless of the scale of the objects.
@mastersthesis{diva2:1243310,
author = {Olsson, Fredrik},
title = {{Feature Based Learning for Point Cloud Labeling and Grasp Point Detection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5165--SE}},
year = {2018},
address = {Sweden},
}
Data about the earth is increasing in value and demand from customers, but itis difficult to produce accurately and cheap. This thesis examines if it is possible to take low resolution and distorted 3D data and increase the accuracy of building geometry by performing building reconstruction. Building reconstruction is performed with a Markov chain Monte Carlo method where building primitives are placed iteratively until a good fit is found. The digital height model and pixel classification used is produced by Vricon. The method is able to correctly place primitive models, but often overestimate their dimensions by about 15%.
@mastersthesis{diva2:1223969,
author = {Nilsson, Mats},
title = {{Building Reconstruction of Digital Height Models with the Markov Chain Monte Carlo Method}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5130--SE}},
year = {2018},
address = {Sweden},
}
When a Time-of-Flight (ToF) depth camera is used to monitor a region of interest, it has to be mounted correctly and have information regarding its position. Manual configuration currently require managing captured 3D ToF data in a 2D environment, which limits the user and might give rise to errors due to misinterpretation of the data. This thesis investigates if a real time 3D reconstruction mesh from a Microsoft HoloLens can be used as a target for point cloud registration using the ToF data, thus configuring the camera autonomously. Three registration algorithms, Fast Global Registration (FGR), Joint Registration Multiple Point Clouds (JR-MPC) and Prerejective RANSAC, were evaluated for this purpose.
It was concluded that despite using different sensors it is possible to perform accurate registration. Also, it was shown that the registration can be done accurately within a reasonable time, compared with the inherent time to perform 3D reconstruction on the Hololens. All algorithms could solve the problem, but it was concluded that FGR provided the most satisfying results, though requiring several constraints on the data.
@mastersthesis{diva2:1222450,
author = {Kjell\'{e}n, Kevin},
title = {{Point Cloud Registration in Augmented Reality using the Microsoft HoloLens}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5160--SE}},
year = {2018},
address = {Sweden},
}
Volume measurements of timber loads is done in conjunction with timber trade. When dealing with goods of major economic values such as these, it is important to achieve an impartial and fair assessment when determining price-based volumes.
With the help of Saab’s missile targeting technology, CIND AB develops products for digital volume measurement of timber loads. Currently there is a system in operation that automatically reconstructs timber trucks in motion to create measurable images of them. Future iterations of the system is expected to fully automate the scaling by generating a volumetric representation of the timber and calculate its external gross volume. The first challenge towards this development is to separate the timber load from the truck.
This thesis aims to evaluate and implement appropriate method for semantic pixel-wise segmentation of timber loads in real time. Image segmentation is a classic but difficult problem in computer vision. To achieve greater robustness, it is therefore important to carefully study and make use of the conditions given by the existing system. Variations in timber type, truck type and packing together create unique combinations that the system must be able to handle. The system must work around the clock in different weather conditions while maintaining high precision and performance.
@mastersthesis{diva2:1222024,
author = {Sällqvist, Jessica},
title = {{Real-time 3D Semantic Segmentation of Timber Loads with Convolutional Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5131--SE}},
year = {2018},
address = {Sweden},
}
The cost and environmental damage of reclaims is a large problem within thepaper industry. With certain types of paper, so called crepe marks on the paper’ssurface is a common issue, leading to printing defects and consequentlyreclaims. This thesis compares four different image analysis methods for evaluatingcrepe marks and predicting printing results. The methods evaluated consistsof one established methods, two adaptations of established methods andone novel method. All methods were evaluated on the same data, topographicheight images of paper samples from 4 paper rolls of similar type but differingin roughness. The method based on 1D Fourier analysis and the method basedon fully convolutional networks performs best, depending on if speed or detailedcharacteristics is a priority.
@mastersthesis{diva2:1219118,
author = {Strömberg, Isak},
title = {{Characterization of creping marks in paper}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5151--SE}},
year = {2018},
address = {Sweden},
}
In this thesis we study a perception problem in the context of autonomous driving. Specifically, we study the computer vision problem of 3D object detection, in which objects should be detected from various sensor data and their position in the 3D world should be estimated. We also study the application of Generative Adversarial Networks in domain adaptation techniques, aiming to improve the 3D object detection model's ability to transfer between different domains.
The state-of-the-art Frustum-PointNet architecture for LiDAR-based 3D object detection was implemented and found to closely match its reported performance when trained and evaluated on the KITTI dataset. The architecture was also found to transfer reasonably well from the synthetic SYN dataset to KITTI, and is thus believed to be usable in a semi-automatic 3D bounding box annotation process. The Frustum-PointNet architecture was also extended to explicitly utilize image features, which surprisingly degraded its detection performance. Furthermore, an image-only 3D object detection model was designed and implemented, which was found to compare quite favourably with current state-of-the-art in terms of detection performance.
Additionally, the PixelDA approach was adopted and successfully applied to the MNIST to MNIST-M domain adaptation problem, which validated the idea that unsupervised domain adaptation using Generative Adversarial Networks can improve the performance of a task network for a dataset lacking ground truth annotations. Surprisingly, the approach did however not significantly improve upon the performance of the image-based 3D object detection models when trained on the SYN dataset and evaluated on KITTI.
@mastersthesis{diva2:1218149,
author = {Gustafsson, Fredrik and Linder-Nor\'{e}n, Erik},
title = {{Automotive 3D Object Detection Without Target Domain Annotations}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5138--SE}},
year = {2018},
address = {Sweden},
}
3D reconstruction is the process of constructing a three-dimensional model from images. It contains multiple steps where each step can induce errors. When doing 3D reconstruction of outdoor scenes, there are some types of scene content that regularly cause problems and affect the resulting 3D model. Two of these are water, due to its fluctuating nature, and sky because of it containing no useful (3D) data. These areas cause different problems throughout the process and do generally not benefit it in any way. Therefore, masking them early in the reconstruction chain could be a useful step in an outdoor scene reconstruction pipeline. Manual masking of images is a time-consuming and boring task and it gets very tedious for big data sets which are often used in large scale 3D reconstructions. This master thesis explores if this can be done automatically using Convolutional Neural Networks for semantic segmentation, and to what degree the masking would benefit a 3D reconstruction pipeline.
@mastersthesis{diva2:1216761,
author = {Kernell, Björn},
title = {{Improving Photogrammetry using Semantic Segmentation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5118--SE}},
year = {2018},
address = {Sweden},
}
The aim of this work is to find a method for removing haze from satellite imagery. This is done by taking two algorithms developed for images taken from the sur- face of the earth and adapting them for satellite images. The two algorithms are Single Image Haze Removal Using Dark Channel Prior by He et al. and Color Im- age Dehazing Using the Near-Infrared by Schaul et al. Both algorithms, altered to fit satellite images, plus the combination are applied on four sets of satellite images. The results are compared with each other and the unaltered images. The evaluation is both qualitative, i.e. looking at the images, and quantitative using three properties: colorfulness, contrast and saturated pixels. Both the qualitative and the quantitative evaluation determined that using only the altered version of Dark Channel Prior gives the result with the least amount of haze and whose colors look most like reality.
@mastersthesis{diva2:1215181,
author = {Hultberg, Johanna},
title = {{Dehazing of Satellite Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5121--SE}},
year = {2018},
address = {Sweden},
}
Deep learning has been rapidly growing in recent years obtaining excellent results for many computer vision applications, such as image classification and object detection. One aspect for the increased popularity of deep learning is that it mitigates the need for hand-crafted features. This thesis work investigates deep learning as a methodology to solve the problem of autonomous collision avoidance for a small robotic car. To accomplish this, transfer learning is used with the VGG16 deep network pre-trained on ImageNet dataset. A dataset has been collected and then used to fine-tune and validate the network offline. The deep network has been used with the robotic car in a real-time manner. The robotic car sends images to an external computer, which is used for running the network. The predictions from the network is sent back to the robotic car which takes actions based on those predictions. The results show that deep learning has great potential in solving the collision avoidance problem.
@mastersthesis{diva2:1204063,
author = {Strömgren, Oliver},
title = {{Deep Learning for Autonomous Collision Avoidance}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5115--SE}},
year = {2018},
address = {Sweden},
}
Industrial applications of computer vision often utilize traditional image processing techniques whereas state-of-the-art methods in most image processing challenges are almost exclusively based on convolutional neural networks (CNNs). Thus there is a large potential for improving the performance of many machine vision applications by incorporating CNNs.
One such application is the classification of juice boxes with straws, where the baseline solution uses classical image processing techniques on depth images to reject or accept juice boxes. This thesis aim to investigate how CNNs perform on the task of semantic segmentation (pixel-wise classification) of said images and if the result can be used to increase classification performance.
A drawback of CNNs is that they usually require large amounts of labelled data for training to be able to generalize and learn anything useful. As labelled data is hard to come by, two ways to get cheap data are investigated, one being synthetic data generation and the other being automatic labelling using the baseline solution.
The implemented network performs well on semantic segmentation, even when trained on synthetic data only, though the performance increases with the ratio of real (automatically labelled) to synthetic images. The classification task is very sensitive to small errors in semantic segmentation and the results are therefore not as good as the baseline solution. It is suspected that the drop in performance between validation and test data is due to a domain shift between the data sets, e.g. variations in data collection and straw and box type, and fine-tuning to the target domain could definitely increase performance.
When trained on synthetic data the domain shift is even larger and the performance on classification is next to useless. It is likely that the results could be improved by using more advanced data generation, e.g. a generative adversarial network (GAN), or more rigorous modelling of the data.
@mastersthesis{diva2:1189501,
author = {Carlsson, Mattias},
title = {{Neural Networks for Semantic Segmentation in the Food Packaging Industry}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5113--SE}},
year = {2018},
address = {Sweden},
}
Photos captured in the shortwave infrared (SWIR) spectrum are interesting in military applications because they are independent of what time of day the pic- ture is captured because the sun, moon, stars and night glow illuminate the earth with short-wave infrared radiation constantly. A major problem with today’s SWIR cameras is that they are very expensive to produce and hence not broadly available either within the military or to civilians. Using a relatively new tech- nology called compressive sensing (CS), enables a new type of camera with only a single pixel sensor in the sensor (a SPC). This new type of camera only needs a fraction of measurements relative to the number of pixels to be reconstructed and reduces the cost of a short-wave infrared camera with a factor of 20. The camera uses a micromirror array (DMD) to select which mirrors (pixels) to be measured in the scene, thus creating an underdetermined linear equation system that can be solved using the techniques described in CS to reconstruct the im- age. Given the new technology, it is in the Swedish Defence Research Agency (FOI) interest to evaluate the potential of a single pixel camera. With a SPC ar- chitecture developed by FOI, the goal of this thesis was to develop methods for sampling, reconstructing images and evaluating their quality. This thesis shows that structured random matrices and fast transforms have to be used to enable high resolution images and speed up the process of reconstructing images signifi- cantly. The evaluation of the images could be done with standard measurements associated with camera evaluation and showed that the camera can reproduce high resolution images with relative high image quality in daylight.
@mastersthesis{diva2:1185507,
author = {Brorsson, Andreas},
title = {{Compressive Sensing: Single Pixel SWIR Imaging of Natural Scenes}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5108--SE}},
year = {2018},
address = {Sweden},
}
In this report I summarize my master’s thesis work, in which I have investigated different approaches for fusing imaging modalities for semantic segmentation with deep convolutional networks. State-of-the-art methods for semantic segmentation of RGB-images use pre-trained models, which are fine-tuned to learn task-specific deep features. However, the use of pre-trained model weights constrains the model input to images with three channels (e.g. RGB-images). In some applications, e.g. classification of satellite imagery, there are other imaging modalities that can complement the information from the RGB modality and, thus, improve the performance of the classification. In this thesis, semantic segmentation methods designed for RGB images are extended to handle multiple imaging modalities, without compromising on the benefits, that pre-training on RGB datasets offers.
In the experiments of this thesis, RGB images from satellites have been fused with normalised difference vegetation index (NDVI) and a digital surface model (DSM). The evaluation shows that the modality fusion can significantly improve the performance of semantic segmentation networks in comparison with a corresponding network with only RGB input. However, the different investigated approaches to fuse the modalities proved to achieve similar performance. The conclusion of the experiments is, that the fusion of imaging modalities is necessary, but the method of fusion has shown to be of less importance.
@mastersthesis{diva2:1182913,
author = {Sundelius, Carl},
title = {{Deep Fusion of Imaging Modalities for Semantic Segmentation of Satellite Imagery}},
school = {Linköping University},
type = {{LiTH-ISY-EX--18/5110--SE}},
year = {2018},
address = {Sweden},
}
The thesis work evaluates a method to estimate the volume of stone and gravelpiles using only a cellphone to collect video and sensor data from the gyroscopesand accelerometers. The project is commissioned by Escenda Engineering withthe motivation to replace more complex and resource demanding systems with acheaper and easy to use handheld device. The implementation features popularcomputer vision methods such as KLT-tracking, Structure-from-Motion, SpaceCarving together with some Sensor Fusion. The results imply that it is possible toestimate volumes up to a certain accuracy which is limited by the sensor qualityand with a bias.
@mastersthesis{diva2:1172784,
author = {Fallqvist, Marcus},
title = {{Automatic Volume Estimation Using Structure-from-Motion Fused with a Cellphone's Inertial Sensors}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5107--SE}},
year = {2017},
address = {Sweden},
}
Semantic segmentation of a scene aims to give meaning to the scene by dividing it into meaningful — semantic — parts. Understanding the scene is of great interest for all kinds of autonomous systems, but manual annotation is simply too time consuming, which is why there is a need for an alternative approach. This thesis investigates the possibility of automatically segmenting 3D-models of urban scenes, such as buildings, into a predetermined set of labels. The approach was to first acquire ground truth data by manually annotating five 3D-models of different urban scenes. The next step was to extract features from the 3D-models and evaluate which ones constitutes a suitable feature space. Finally, three supervised learners were implemented and evaluated: k-Nearest Neighbour (KNN), Support Vector Machine (SVM) and Random Classification Forest (RCF). The classifications were done point-wise, classifying each 3D-point in the dense point cloud belonging to the model being classified.
The result showed that the best suitable feature space is not necessarily the one containing all features. The KNN classifier got the highest average accuracy overall models — classifying 42.5% of the 3D points correct. The RCF classifier managed to classify 66.7% points correct in one of the models, but had worse performance for the rest of the models and thus resulting in a lower average accuracy compared to KNN. In general, KNN, SVM, and RCF seemed to have different benefits and drawbacks. KNN is simple and intuitive but by far the slowest classifier when dealing with a large set of training data. SVM and RCF are both fast but difficult to tune as there are more parameters to adjust. Whether the reason for obtaining the relatively low highest accuracy was due to the lack of ground truth training data, unbalanced validation models, or the capacity of the learners, was never investigated due to a limited time span. However, this ought to be investigated in future studies.
@mastersthesis{diva2:1166634,
author = {Lind, Johan},
title = {{Make it Meaningful:
Semantic Segmentation of Three-Dimensional Urban Scene Models}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5103--SE}},
year = {2017},
address = {Sweden},
}
Barcodes are ubiquituous in modern society and they have had industrial application for decades. However, for noisy images modern methods can underperform. Poor lighting conditions, occlusions and low resolution can be problematic in decoding. This thesis aims to solve this problem by using neural networks, which have enjoyed great success in many computer vision competitions the last years. We investigate how three different networks perform on data sets with noisy images. The first network is a single classifier, the second network is an ensemble classifier and the third is based on a pre-trained feature extractor. For comparison, we also test two baseline methods that are used in industry today. We generate training data using software and modify it to ensure proper generalization. Testing data is created by photographing barcodes in different settings, creating six image classes - normal, dark, white, rotated, occluded and wrinkled. The proposed single classifier and ensemble classifier outperform the baseline as well as the pre-trained feature extractor by a large margin. The thesis work was performed at SICK IVP, a machine vision company in Linköping in 2017.
@mastersthesis{diva2:1164104,
author = {Fridborn, Fredrik},
title = {{Reading Barcodes with Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5102--SE}},
year = {2017},
address = {Sweden},
}
Being able to reconstruct real-world environments into digital 3D models is something that has many different types of interesting applications. With the current state of the art, the results can be very impressive, but there is naturally still room for improvements. This thesis looks into essentially two different parts. The first part is about finding out wether it is feasible to detect geometric primitives, mainly planes, in the initially reconstructed point cloud. The second part looks into using the information about which points have been fitted to a geometric primitive to improve the final model.
Detection of the geometric primitives is done using the RANSAC-algorithm, which is a method for discovering if a given model is present in a data set.
A few different alternatives are evaluated for using the information about the geometric primitives to improve the final surface. The first option is to project points onto their identified shape. The second option is to remove points that have not been matched to a shape. The last option is to evaluate the possibility of changing the weights of individual points, which is an alternative available in the chosen surface reconstruction method.
The detection of geometric primitives shows some potential, but it often requires manual intervention to find correct parameters for different types of data sets. As for using the information about the geometric primitives to improve the final model, both projecting points and removal of non-matched points, does not quite address the problem at hand. Increasing the weights on matched points does show some potential, however, but is still far from being a complete method.
A small part of the thesis looks into the possibility of automatically finding areas where there are significant differences between the initial point cloud and a reconstructed surface. For this, hierarchical clustering is used. This part is however not evaluated quantitatively
@mastersthesis{diva2:1153573,
author = {Norlander, Robert},
title = {{Make it Complete:
Surface Reconstruction Aided by Geometric Primitives}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5096--SE}},
year = {2017},
address = {Sweden},
}
The ability to automatically estimate the volume of timber is becoming increasingly important within the timber industry. The large number of timber trucks arriving each day at Swedish timber terminals fortifies the need for a volume estimation performed in real-time and on-the-go as the trucks arrive.
This thesis investigates if a volumetric integration of disparity maps acquired from a Multi-View Stereo (MVS) system is a suitable approach for automatic volume estimation of timber loads. As real-time execution is preferred, efforts were made to provide a scalable method. The proposed method was quantitatively evaluated on datasets containing two geometric objects of known volume. A qualitative comparison to manual volume estimates of timber loads was also made on datasets recorded at a Swedish timber terminal.
The proposed method is shown to be both accurate and precise under specific circumstances. However, robustness is poor to varying weather conditions, although a more thorough evaluation of this aspect needs to be performed. The method is also parallelizable, which means that future efforts can be made to significantly decrease execution time.
@mastersthesis{diva2:1153580,
author = {Rundgren, Emil},
title = {{Automatic Volume Estimation of Timber from Multi-View Stereo 3D Reconstruction}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5093--SE}},
year = {2017},
address = {Sweden},
}
During flights with manned or unmanned aircraft, continuous recording can result in avery high number of images to analyze and evaluate. To simplify image analysis and tominimize data link usage, appropriate images should be suggested for transfer and furtheranalysis. This thesis investigates features used for selection of images worthy of furtheranalysis using machine learning. The selection is done based on the criteria of havinggood quality, salient content and being unique compared to the other selected images.The investigation is approached by implementing two binary classifications, one regardingcontent and one regarding quality. The classifications are made using support vectormachines. For each of the classifications three feature extraction methods are performedand the results are compared against each other. The feature extraction methods used arehistograms of oriented gradients, features from the discrete cosine transform domain andfeatures extracted from a pre-trained convolutional neural network. The images classifiedas both good and salient are then clustered based on similarity measures retrieved usingcolor coherence vectors. One image from each cluster is retrieved and those are the resultingimages from the image selection. The performance of the selection is evaluated usingthe measures precision, recall and accuracy. The investigation showed that using featuresextracted from the discrete cosine transform provided the best results for the quality classification.For the content classification, features extracted from a convolutional neuralnetwork provided the best results. The similarity retrieval showed to be the weakest partand the entire system together provides an average accuracy of 83.99%.
@mastersthesis{diva2:1151145,
author = {Lorentzon, Matilda},
title = {{Feature Extraction for Image Selection Using Machine Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5097--SE}},
year = {2017},
address = {Sweden},
}
The recent emergence of time-of-flight cameras has opened up new possibilities in the world of computer vision. These compact sensors, capable of recording the depth of a scene in real-time, are very advantageous in many applications, such as scene or object reconstruction. This thesis first addresses the problem of fusing depth data with color images. A complete process to combine a time-of-flight camera with a color camera is described and its accuracy is evaluated. The results show that a satisfying precision is reached and that the step of calibration is very important.
The second part of the work consists of applying super-resolution techniques to the time-of-flight camera in order to improve its low resolution. Different types of super-resolution algorithms exist but this thesis focuses on the combination of multiple shifted depth maps. The proposed framework is made of two steps: registration and reconstruction. Different methods for each step are tested and compared according to the improvements reached in term of level of details, sharpness and noise reduction. The results obtained show that Lucas-Kanade performs the best for the registration and that a non-uniform interpolation gives the best results in term of reconstruction. Finally, a few suggestions are made about future work and extensions for our solutions.
@mastersthesis{diva2:1149382,
author = {Zins, Matthieu},
title = {{Color Fusion and Super-Resolution for Time-of-Flight Cameras}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5089--SE}},
year = {2017},
address = {Sweden},
}
Extracting foreground objects from an image is a hot research topic. Doing thisfor high quality real world images in real-time on limited hardware such as asmart phone, is a demanding task. This master thesis shows how this problemcan be addressed using Otsu’s method together with Gaussian probability dis-tributions to create classifiers in different colour channels. We also show howclassifiers can be combined resulting in higher accuracy than using only the indi-vidual classifiers. We also propose using inter-class variance together with imagevariance to estimate classifier quality.A data set was produced to evaluate performance. The data set featuresreal-world images captured by a smart phone and objects of varying complex-ity against plain backgrounds that can be found in a typical office or urban space.
@mastersthesis{diva2:1144357,
author = {poole, alexander},
title = {{Real-Time Image Segmentation for Augmented Reality by Combiningmulti-Channel Thresholds.}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5083--SE}},
year = {2017},
address = {Sweden},
}
The objective of this master’s thesis work is to evaluate the potential benefit of a superpixel preprocessing step for general object detection in a traffic environment. The various effects of different superpixel parameters on object detection performance, as well as the benefit of including depth information when generating the superpixels are investigated.
In this work, three superpixel algorithms are implemented and compared, including a proposal for an improved version of the popular Spectral Linear Iterative Clustering superpixel algorithm (SLIC). The proposed improved algorithm utilises a coarse-to-fine approach which outperforms the original SLIC for high-resolution images. An object detection algorithm is also implemented and evaluated. The algorithm makes use of depth information obtained by a stereo camera to extract superpixels corresponding to foreground objects in the image. Hierarchical clustering is then applied, with the segments formed by the clustered superpixels indicating potential objects in the input image.
The object detection algorithm managed to detect on average 58% of the objects present in the chosen dataset. It performed especially well for detecting pedestrians or other objects close to the car. Altering the density distribution of the superpixels in the image yielded an increase in detection rate, and could be achieved both with or without utilising depth information. It was also shown that the use of superpixels greatly reduces the amount of computations needed for the algorithm, indicating that a real-time implementation is feasible.
@mastersthesis{diva2:1141088,
author = {Wälivaara, Marcus},
title = {{General Object Detection Using Superpixel Preprocessing}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5085--SE}},
year = {2017},
address = {Sweden},
}
The two main bottlenecks using deep neural networks are data dependency and training time. This thesis proposes a novel method for weight initialization of the convolutional layers in a convolutional neural network. This thesis introduces the usage of sparse dictionaries. A sparse dictionary optimized on domain specific data can be seen as a set of intelligent feature extracting filters. This thesis investigates the effect of using such filters as kernels in the convolutional layers in the neural network. How do they affect the training time and final performance?
The dataset used here is the Cityscapes-dataset which is a library of 25000 labeled road scene images.The sparse dictionary was acquired using the K-SVD method. The filters were added to two different networks whose performance was tested individually. One of the architectures is much deeper than the other. The results have been presented for both networks. The results show that filter initialization is an important aspect which should be taken into consideration while training the deep networks for semantic segmentation.
@mastersthesis{diva2:1127291,
author = {Andersson, Viktor},
title = {{Semantic Segmentation:
Using Convolutional Neural Networks and Sparse dictionaries}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5054--SE}},
year = {2017},
address = {Sweden},
}
All dairy cows in Europe wear unique identification tags in their ears. These eartags are standardized and contains the cows identification numbers, today only used for visual identification by the farmer. The cow also needs to be identified by an automatic identification system connected to milk machines and other robotics used at the farm. Currently this is solved with a non-standardized radio transmitter which can be placed on different places on the cow and different receivers needs to be used on different farms. Other drawbacks with the currently used identification system are that it is expensive and unreliable. This thesis explores the possibility to replace this non standardized radio frequency based identification system with a standardized computer vision based system. The method proposed in this thesis uses a color threshold approach for detection, a flood fill approach followed by Hough transform and a projection method for segmentation and evaluates template matching, k-nearest neighbour and support vector machines as optical character recognition methods. The result from the thesis shows that the quality of the data used as input to the system is vital. By using good data, k-nearest neighbour, which showed the best results of the three OCR approaches, handles 98 % of the digits.
@mastersthesis{diva2:1120668,
author = {Ilestrand, Maja},
title = {{Automatic Eartag Recognition on Dairy Cows in Real Barn Environment}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5072--SE}},
year = {2017},
address = {Sweden},
}
Image registration is the process of geometrically deforming a template image into a reference image. This technique is important and widely used within thefield of medical IT. The purpose could be to detect image variations, pathologicaldevelopment or in the company AMRA’s case, to quantify fat tissue in variousparts of the human body.From an MRI (Magnetic Resonance Imaging) scan, a water and fat tissue image isobtained. Currently, AMRA is using the Morphon algorithm to register and segment the water image in order to quantify fat and muscle tissue. During the firstpart of this master thesis, two alternative registration methods were evaluated.The first algorithm was Free Form Deformation which is a non-linear parametricbased method. The second algorithm was a non-parametric optical flow basedmethod known as the Demon algorithm. During the second part of the thesis,the Demon algorithm was used to evaluate the effect of using the fat images forregistrations.
@mastersthesis{diva2:1118172,
author = {Ivarsson, Magnus},
title = {{Evaluation of 3D MRI Image Registration Methods}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5037--SE}},
year = {2017},
address = {Sweden},
}
Modern cars are often equipped with sensors like radar, infrared cameras and stereo cameras that collect information about its surroundings. By using a stereo camera, it is possible to receive information about the distance to points in front of the car. This information can be used to estimate the height of the predicted path of the car. An application which does this is the stereo based Road surface preview (RSP) algorithm. By using the output from the RSP algorithm it is possible to use active suspension control, which controls the vertical movement of the wheels relative to the chassis. This application primarily makes the driving experience more comfortable, but also extends the durability of the vehicle. The idea behind this Master’s thesis is to create an evaluation tool for the RSP algorithm, which can be used at arbitrary roads.
The thesis describes the proposed evaluation tool, where focus has been to make an accurate comparison of camera data received from the RSP algorithm and laser data used as ground truth in this thesis. Since the tool shall be used at the company proposing this thesis, focus has also been on making the tool user friendly. The report discusses the proposed methods, possible sources to errors and improvements. The evaluation tool considered in this thesis shows good results for the available test data, which made it possible to include an investigation of a possible improvement of the RSP algorithm.
@mastersthesis{diva2:1115333,
author = {Manfredsson, Johan},
title = {{Evaluation Tool for a Road Surface Algorithm}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5063--SE}},
year = {2017},
address = {Sweden},
}
Deep learning has dominated the computer vision field since 2012, but a common criticism of deep learning methods is their dependence on large amounts of data. To combat this criticism research into data-efficient deep learning is growing. The foremost success in data-efficient deep learning is transfer learning with networks pre-trained on the ImageNet dataset. Pre-trained networks have achieved state-of-the-art performance on many tasks. We consider the pre-trained network method for a new task where we have to collect the data. We hypothesize that the data efficiency of pre-trained networks can be improved through informed data collection. After exhaustive experiments on CaffeNet and VGG16, we conclude that the data efficiency indeed can be improved. Furthermore, we investigate an alternative approach to data-efficient learning, namely adding domain knowledge in the form of a spatial transformer to the pre-trained networks. We find that spatial transformers are difficult to train and seem to not improve data efficiency.
@mastersthesis{diva2:1112122,
author = {Lundström, Dennis},
title = {{Data-efficient Transfer Learning with Pre-trained Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5051--SE}},
year = {2017},
address = {Sweden},
}
Visual Object Tracking is the computer vision problem of estimating a target trajectory in a video given only its initial state. A visual tracker often acts as a component in the intelligent vision systems seen in for instance surveillance, autonomous vehicles or robots, and unmanned aerial vehicles. Applications may require robust tracking performance on difficult sequences depicting targets undergoing large changes in appearance, while enforcing a real-time constraint. Discriminative correlation filters have shown promising tracking performance in recent years, and consistently improved state-of-the-art. With the advent of deep learning, new robust deep features have improved tracking performance considerably. However, methods based on discriminative correlation filters learn a rigid template describing the target appearance. This implies an assumption of target rigidity which is not fulfilled in practice. This thesis introduces an approach which integrates deformability into a stateof-the-art tracker. The approach is thoroughly tested on three challenging visual tracking benchmarks, achieving state-of-the-art performance.
@mastersthesis{diva2:1111930,
author = {Johnander, Joakim},
title = {{Visual Tracking with Deformable Continuous Convolution Operators}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5047--SE}},
year = {2017},
address = {Sweden},
}
There is a need for quantitative CT data in radiation therapy. Currently there are only few algorithms that address this issue, for instance the commercial DirectDensity algorithm. In scientific literature, an example of such an algorithm is DIRA. DIRA is an iterative model-based reconstruction method for dual-energy CT whose goal is to determine the material composition of the patient from accurate linear attenuation coefficients. It has been implemented in a two dimensional geometry, i.e., it could process axial scans only. There was a need to extend DIRA so that it could process projection data generated in helical scanning geometries. The newly developed algorithm (DIRA-3D) implemented (i) polyenergetic semi-parallel projection generation, (ii) mono-energetic parallel projection generation and (iii) the PI-method for image reconstruction. The computation experiments showed that the accuracies of the resulting LAC and mass fractions were comparable to the ones of the original DIRA. The results converged after 10 iterations.
@mastersthesis{diva2:1111894,
author = {Björnfot, Magnus},
title = {{Extension of DIRA (Dual-Energy Iterative Algorithm) to 3D Helical CT}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5057--SE}},
year = {2017},
address = {Sweden},
}
This work investigates the landscape of aerial image stereo matching (AISM) methods suitable for large scale forest variable estimation. AISM methods are an important source of remotely collected information used in modern forestry to keep track of a growing forest's condition.
A total of 17 AISM methods are investigated, out of which 4 are evaluated by processing a test data set consisting of three aerial images. The test area is located in southern Sweden, consisting of mainly Norway Spruce and Scots Pine. From the resulting point clouds and height raster images, a total of 30 different metrics of both height and density types are derived. Linear regression is used to fit functions from metrics derived from AISM data to a set of forest variables including tree height (HBW), tree diameter (DBW), basal area, volume. As ground truth, data collected by dense airborne laser scanning is used. Results are presented as RMSE and standard deviation concluded from the linear regression.
For tree height, tree diameter, basal area, volume the RMSE ranged from 7.442% to 10.11%, 11.58% to 13.96%, 32.01% to 35.10% and 34.01% to 38.26% respectively. The results concluded that all four tested methods achieved comparable estimation quality although showing small differences among them. Keystone and SURE performed somewhat better while MicMac placed third and Photoscan achieved the less accurate result.
@mastersthesis{diva2:1109735,
author = {Svensk, Joakim},
title = {{Evaluation of Aerial Image Stereo Matching Methods for Forest Variable Estimation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5036--SE}},
year = {2017},
address = {Sweden},
}
Now and then train accidents occur. Collisions between trains and objects such as animals, humans, cars, and fallen trees can result in casualties, severe damage on the train, and delays in the train traffic. Thus, train collisions are a considerable problem with consequences affecting society substantially.
The company Termisk Systemteknik AB has on commission by Rindi Solutions AB investigated the possibility to detect anomalies on the railway using a trainmounted thermal imaging camera. Rails are also detected in order to determine if an anomaly is on the rail or not. However, the rail detection method does not work satisfactory at long range.
The purpose of this master’s thesis is to improve the previous rail detector at long range by using machine learning, and in particular deep learning and a convolutional neural network. Of interest is also to investigate if there are any advantages using cross-modal transfer learning.
A labelled dataset for training and testing was produced manually. Also, a loss function tailored to the particular problem at hand was constructed. The loss function was used both for improving the system during training and evaluate the system’s performance during testing. Finally, eight different approaches were evaluated, each one resulting in a different rail detector.
Several of the rail detectors, and in particular all the rail detectors using crossmodal transfer learning, perform better than the previous rail detector. Thus, the new rail detectors show great potential to the rail detection problem.
@mastersthesis{diva2:1111486,
author = {Wedberg, Magnus},
title = {{Detecting Rails in Images from a Train-Mounted Thermal Camera Using a Convolutional Neural Network}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5058--SE}},
year = {2017},
address = {Sweden},
}
Automated navigability assessment based on image sensor data is an important concern in the design of autonomous robotic systems. The problem consists in finding a mapping from input data to the navigability status of different areas of the surrounding world. Machine learning techniques are often applied to this problem. This thesis investigates an approach to navigability assessment in the image plane, based on offline learning using deep convolutional neural networks, applied to RGB and depth data collected using a robotic platform. Training outputs were generated by manually marking out instances of near collision in the sequences and tracing back the location of the near-collision frame through the previous frames. Several combinations of network inputs were tried out. Inputs included grayscale gradient versions of the RGB frames, depth maps, image coordinate maps and motion information in the form of a previous RGB frame or heading maps. Some improvement compared to simple depth thresholding was demonstrated, mainly in the handling of noise and missing pixels in the depth maps. The resulting networks appear to be mostly dependent on depth information; an attempt to train a network without the depth frames was unsuccessful,and a network trained using the depth frames alone performed similarly to networks trained with additional inputs. An unsuccessful attempt at training a network towards a more motion-dependent navigability concept was also made. It was done by including training frames captured as the robot was moving away from the obstacle, where the corresponding training outputs were marked as obstacle-free.
@mastersthesis{diva2:1110839,
author = {Wimby Schmidt, Ebba},
title = {{Navigability Assessment for Autonomous Systems Using Deep Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5045--SE}},
year = {2017},
address = {Sweden},
}
This thesis investigates if support vector machine classification is a suitable approach when performing automatic segmentation of knee cartilage using quantitative magnetic resonance imaging data. The data sets used are part of a clinical project that investigates if patients that have suffered recent knee damage will develop cartilage damage. Therefore the thesis also investigates if the segmentation results can be used to predict the clinical outcome of the patients.
Two methods that perform the segmentation using support vector machine classification are implemented and evaluated. The evaluation indicates that it is a good approach for the task, but the implemented methods needs to be further improved and tested on more data sets before clinical use.
It was not possible to relate the cartilage properties to clinical outcome using the segmentation results. However, the investigation demonstrated good promise of how the segmentation results, if they are improved, can be used in combination with quantitative magnetic resonance imaging data to analyze how the cartilage properties change over time or vary between knees.
@mastersthesis{diva2:1109911,
author = {Lind, Marcus},
title = {{Automatic Segmentation of Knee Cartilage Using Quantitative MRI Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5041--SE}},
year = {2017},
address = {Sweden},
}
When forensic examiners try to identify the perpetrator of a felony, they use individual facial marks when comparing the suspect with the perpetrator. Facial marks are often used for identification and they are nowadays found manually. To speed up this process, it is desired to detect interesting facial marks automatically. This master thesis describes a method to automatically detect and separate permanent and non-permanent marks. It uses a fast radial symmetry algorithm as a core element in the mark detector. After candidate skin mark extraction, the false detections are removed depending on their size, shape and number of hair pixels. The classification of the skin marks is done with a support vector machine and the different features are examined. The results show that the facial mark detector has a good recall while the precision is poor. The elimination methods of false detection were analysed as well as the different features for the classifier. One can conclude that the color of facial marks is more relevant than the structure when classifying them into permanent and non-permanent marks.
@mastersthesis{diva2:1107743,
author = {Moulis, Armand},
title = {{Automatic Detection and Classification of Permanent and Non-Permanent Skin Marks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5048--SE }},
year = {2017},
address = {Sweden},
}
After a digital photo has been taken by a camera, it can be manipulated to be more appealing. Two ways of doing that are to reduce noise and to increase the saturation. With time and skills in an image manipulating program, this is usually done by hand. In this thesis, automatic image improvement based on artificial neural networks is explored and evaluated qualitatively and quantitatively. A new approach, which builds on an existing method for colorizing gray scale images is presented and its performance compared both to simpler methods and the state of the art in image denoising. Saturation is lowered and noise added to original images, which the methods receive as inputs to improve upon. The new method is shown to improve in some cases but not all, depending on the image and how it was modified before given to the method.
@mastersthesis{diva2:1098332,
author = {Lind, Benjamin},
title = {{Artificial Neural Networks for Image Improvement}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5025--SE}},
year = {2017},
address = {Sweden},
}
In computer vision, it has in recent years become more popular to use point clouds to represent 3D data. To understand what a point cloud contains, methods like semantic segmentation can be used. Semantic segmentation is the problem of segmenting images or point clouds and understanding what the different segments are. An application for semantic segmentation of point clouds are e.g. autonomous driving, where the car needs information about objects in its surrounding.
Our approach to the problem, is to project the point clouds into 2D virtual images using the Katz projection. Then we use pre-trained convolutional neural networks to semantically segment the images. To get the semantically segmented point clouds, we project back the scores from the segmentation into the point cloud. Our approach is evaluated on the semantic3D dataset. We find our method is comparable to state-of-the-art, without any fine-tuning on the Semantic3Ddataset.
@mastersthesis{diva2:1091059,
author = {Tosteberg, Patrik},
title = {{Semantic Segmentation of Point Clouds Using Deep Learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--17/5029--SE}},
year = {2017},
address = {Sweden},
}
In many situations after a big catastrophe such as the one in Fukushima, the disaster area is highly dangerous for humans to enter. It is in such environments that a semi-autonomous robot could limit the risks to humans by exploring and mapping the area on its own. This thesis intends to design and implement a software based SLAM system which has potential to run in real-time using a Kinect 2 sensor as input.
The focus of the thesis has been to create a system which allows for efficient storage and representation of the map, in order to be able to explore large environments. This is done by separating the map in different abstraction levels corresponding to local maps connected by a global map.
During the implementation, this structure has been kept in mind in order to allow modularity. This makes it possible for each sub-component in the system to be exchanged if needed.
The thesis is broad in the sense that it uses techniques from distinct areas to solve the sub-problems that exist. Some examples being, object detection and classification, point-cloud registration and efficient 3D-based occupancy trees.
@mastersthesis{diva2:1065996,
author = {Holmquist, Karl},
title = {{SLAMIt A Sub-Map Based SLAM System:
On-line creation of multi-leveled map}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/5021--SE}},
year = {2017},
address = {Sweden},
}
Generic visual tracking is a challenging computer vision problem, where the position of a specified target is estimated through a sequence of frames. The only given information is the initial location of the target. Therefore, the tracker has to adapt and learn any kind of object, which it describes through visual features used to differentiate target from background. Standard appearance features only capture momentary visual information. This master’s thesis investigates the use of deep features extracted through optical flow images processed in a deep convolutional network. The optical flow is calculated using two consecutive images, and thereby captures the dynamic nature of the scene. Results show that this information is complementary to the standard appearance features, and improves performance of the tracker. Deep features are typically very high dimensional. Employing dimensionality reduction can increase both the efficiency and performance of the tracker. As a second aim in this thesis, PCA and PLS were evaluated and compared. The evaluations show that the two methods are almost equal in performance, with PLS actually receiving slightly better score than the popular PCA. The final proposed tracker was evaluated on three challenging datasets, and was shown to outperform other state-of-the-art trackers.
@mastersthesis{diva2:1071737,
author = {Gladh, Susanna},
title = {{Visual Tracking Using Deep Motion Features}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/5005--SE}},
year = {2016},
address = {Sweden},
}
This thesis presents a way to generate a Digital Terrain Model (dtm) from a Digital Surface Model (dsm) and multi spectral images (including the Near Infrared (nir) color band). An Artificial Neural Network (ann) is used to pre-classify the dsm and multi spectral images. This in turn is used to filter the dsm to a dtm. The use of an ann as a classifier provided good results. Additionally, the addition of the nir color band resulted in an improvement of the accuracy of the classifier. Using the classifier, a dtm was easily extracted without removing natural edges or height variations in the forests and cities. These challenges are handled with great satisfaction as compared to earlier methods.
@mastersthesis{diva2:1058430,
author = {Tapper, Gustav},
title = {{Extraction of DTM from Satellite Images Using Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/5017--SE}},
year = {2016},
address = {Sweden},
}
Since most people now have a high-performing computing device with an attached camera in their pocket, in the form of a smartphone, robotics and computer vision researchers are thrilled about the possibility this creates. Such devices have previously been used in robotics to create 3D maps of environments and objects by feeding the camera data to a 3D reconstruction algorithm.
The big downside with smartphones is that their cameras use a different sensor than what is usually used in robotics, namely a rolling shutter camera.These cameras are cheaper to produce but are not as well suited for general 3D reconstruction algorithms as the global shutter cameras typically used in robotics research. One recent, accurate and performance effective 3D reconstruction method which could be used on a mobile device, if tweaked, is LSD-SLAM.
This thesis uses the LSD-SLAM method developed for global shutter cameras and incorporates additional methods developed allow the usage of rolling shutter data.The developed method is evaluated by calculating numbers of failed 3D reconstructions before a successful one is obtained when using rolling shutter data.The result is a method which improves this metric with about 70\% compared to the unedited LSD-SLAM method.
@mastersthesis{diva2:1058367,
author = {Tallund, Lukas},
title = {{Handling of Rolling Shutter Effects in Monocular Semi-Dense SLAM Algorithms}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/5016--SE}},
year = {2016},
address = {Sweden},
}
This master-thesis presents an approach to track and count the number of fruit incommercial mango orchards. The algorithm is intended to enable precision agri-culture and to facilitate labour and post-harvest storage planning. The primary objective is to develop an multi-view algorithm and investigate how it can beused to mitigate the effects of visual occlusion, to improve upon estimates frommethods that use a single central or two opposite viewpoints. Fruit are detectedin images by using two classification methods: dense pixel-wise cnn and regionbased r-cnn detection. Pair-wise fruit correspondences are established between images by using geometry provided by navigation data, and lidar data is used to generate image masks for each separate tree, to isolate fruit counts to individual trees. The tracked fruit are triangulated to locate them in 3D space, and spatial statistics are calculated over whole orchard blocks. The estimated tree counts are compared to single view estimates and validated against ground truth data of 16 mango trees from a Bundaberg mango orchard in Queensland, Australia. The results show a high R2-value of 0.99335 for four hand labelled trees and a highest R2-value of 0.9165 for the machine labelled images using the r-cnn classifier forthe 16 target trees.
@mastersthesis{diva2:1045302,
author = {Stein, Madeleine},
title = {{Improving Image Based Fruitcount Estimates Using Multiple View-Points}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/5003--SE}},
year = {2016},
address = {Sweden},
}
Cardiovascular diseases are among the most common causes of death worldwide. One of the recently developed flow analysis technique called 4D flow magnetic resonance imaging (MRI) allows an early detection of such diseases. Due to the limited resolution and contrast between blood pool and myocardium of 4D flow images, cine MR images are often used for cardiac segmentation. The delineated structures are then transferred to the 4D Flow images for cardiovascular flow analysis. Cine MR images are however acquired with multiple breath-holds, which can be challenging for some people, especially, when a cardiovascular disease is present. Consequently, unexpected breathing motion by a patient may lead to misalignments between the acquired cine MR images.
The goal of the thesis is to test the feasibility of an automatic image registration method to correct the misalignment caused by respiratory motion in morphological 2D cine MR images by using the 4D Flow MR as the reference image. As a registration method relies on a set of optimal parameters to provide desired results, a comprehensive investigation was performed to find such parameters. Different combinations of registration parameters settings were applied on 20 datasets from both healthy volunteers and patients. The best combinations, selected on the basis of normalized cross-correlation, were evaluated using the clinical gold-standard by employing widely used geometric measures of spatial correspondence. The accuracy of the best parameters from geometric evaluation was finally validated by using simulated misalignments.
Using a registration method consisting of only translation improved the results for both datasets from healthy volunteers and patients and the simulated misalignment data. For the datasets from healthy volunteers and patients, the registration improved the results from 0.7074 ± 0.1644 to 0.7551 ± 0.0737 in Dice index and from 1.8818 ± 0.9269 to 1.5953 ± 0.5192 for point-to-curve error. These values are a mean value for all the 20 datasets.
The results from geometric evaluation on the data from both healthy volunteers and patients show that the developed correction method is able to improve the alignment of the cine MR images. This allows a reliable segmentation of 4D flow MR images for cardiac flow assessment.
@mastersthesis{diva2:972664,
author = {Härd, Victoria},
title = {{Automatic Alignment of 2D Cine Morphological Images Using 4D Flow MRI Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4992--SE}},
year = {2016},
address = {Sweden},
}
Object Recognition is the art of localizing predefined objects in image sensor data. In this thesis a depth sensor was used which has the benefit that the 3D pose of the object can be estimated. This has applications in e.g. automatic manufacturing, where a robot picks up parts or tools with a robot arm.
This master thesis presents an implementation and an evaluation of a system for object recognition of 3D models in depth sensor data. The system uses several depth images rendered from a 3D model and describes their characteristics using so-called feature descriptors. These are then matched with the descriptors of a scene depth image to find the 3D pose of the model in the scene. The pose estimate is then refined iteratively using a registration method. Different descriptors and registration methods are investigated.
One of the main contributions of this thesis is that it compares two different types of descriptors, local and global, which has seen little attention in research. This is done for two different scene scenarios, and for different types of objects and depth sensors. The evaluation shows that global descriptors are fast and robust for objects with a smooth visible surface whereas the local descriptors perform better for larger objects in clutter and occlusion. This thesis also presents a novel global descriptor, the CESF, which is observed to be more robust than other global descriptors. As for the registration methods, the ICP is shown to perform most accurately and ICP point-to-plane more robust.
@mastersthesis{diva2:972438,
author = {Grankvist, Ola},
title = {{Recognition and Registration of 3D Models in Depth Sensor Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4993--SE}},
year = {2016},
address = {Sweden},
}
Automatisk travmätning är ett mätsystem som mäter vedvolymen på virkeslastbilar. Systemet består av sex stycken sensor-system. Varje sensor kalibreras först individuellt och sedan ihop för att ge ett sammanfogat världskoordinat system. Varje sensor genererar en djupbild och en reflektansbild, där värdena i djupbilden representerar avståndet från kameran. Uppdragsgivaren har utvecklat en algoritm som utifrån mätdatat(bilderna) uppskattar vedvolymen till en viss noggrannhet som uppfyller kraven ställda av skogsindustrin för automatisk mätning av travar på virkeslastbil. I den här rapporten undersöks om bättre mätresultat kan uppnås exempelvis med andra metoder eller kombinationer av dem.Till förfogande finns ca 125 dataset av travar där facit finns. Facit består av manuella stickprovsmätningar där varje enskild stock mätts för sig. Initialt valdes aktivt att inte sätta sig in i uppdragsgivarens algoritm för att inte bli färgad av hur de kommit fram till sina resultat. Främst används fram- och baksidebilderna av entrave för att hitta stockarna. Därefter interpoleras de funna stockarna in till mitten av traven eller så paras stockarna ihop från de båda sidorna. Ibland finns vissa problem med bilderna. Oftast är minst en av sidorna ockluderade av lastbilshytten, kranen eller en annan trave. Då gäller det att hitta uppskattning utifrån det data man ser för fylla upp de skymda områdena.I början av examensarbetet användes två metoder(MSER och Punktplanmetoden) för undersöka om man kunde uppnå bra resultat utifrån att enbart mäta datat och användadet som initial gissning till volymen. Dock upptäcktes det att värdefulla detaljer i dataseten missades för att mer noggrant bestämma vedvolymen. Exempel på sådan data är fördelningen av diametern på de funna stockändarna. Tillika tenderades kraftig överestimering när travarna innehöll en viss mängd ris och eller dåligt kvistade stockar. Därefter konstruerades en geometrisk metod, och det var den här metoden som det lades mest tid på.I figurerna nedan visas en tabell och en graf där alla tre metoders resultat under bark(UB) visas och intervall gränserna för att uppfylla kraven ställda av skogsindustrin.
@mastersthesis{diva2:968712,
author = {Lindberg, Pontus},
title = {{Automatisk volymmätning av virkestravar på lastbil}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4955--SE}},
year = {2016},
address = {Sweden},
}
Detection and positioning of anatomical landmarks, also called points of interest(POI), is often a concept of interest in medical image processing. Different measures or automatic image analyzes are often directly based upon positions of such points, e.g. in organ segmentation or tissue quantification. Manual positioning of these landmarks is a time consuming and resource demanding process. In this thesis, a general method for positioning of anatomical landmarks is outlined, implemented and evaluated. The evaluation of the method is limited to three different POI; left femur head, right femur head and vertebra T9. These POI are used to define the range of the abdomen in order to measure the amount of abdominal fat in 3D data acquired with quantitative magnetic resonance imaging (MRI). By getting more detailed information about the abdominal body fat composition, medical diagnoses can be issued with higher confidence. Examples of applications could be identifying patients with high risk of developing metabolic or catabolic disease and characterizing the effects of different interventions, i.e. training, bariatric surgery and medications. The proposed method is shown to be highly robust and accurate for positioning of left and right femur head. Due to insufficient performance regarding T9 detection, a modified method is proposed for T9 positioning. The modified method shows promises of accurate and repeatable results but has to be evaluated more extensively in order to draw further conclusions.
@mastersthesis{diva2:957048,
author = {Järrendahl, Hannes},
title = {{Automatic Detection of Anatomical Landmarks in Three-Dimensional MRI}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4990--SE}},
year = {2016},
address = {Sweden},
}
Simultaneous localization and mapping (SLAM) is the problem of mapping your surroundings while simultaneously localizing yourself in the map. It is an important and active area of research for robotics. In this master thesis two approaches are attempted to reduce the drift which appears over time in SLAM algorithms. The first approach tries 3 different motion models for the camera. Two of the models exploit the a priori knowledge that the camera is mounted on a trolley. These two methods are shown to improve the results. The second approach attempts to reduce the drift by reducing noise in the point cloud data used for mapping. This is done by finding planar surfaces in the point clouds. Median filtering is used as an alternative to compare the result for noise reduction. The planes estimation approach is also shown to reduce the drift, while the median estimation makes it worse.
@mastersthesis{diva2:957728,
author = {Bondemark, Richard},
title = {{Improving SLAM on a TOF Camera by Exploiting Planar Surfaces}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4984--SE}},
year = {2016},
address = {Sweden},
}
Measurements performed from stereo reconstruction can be obtained with a high accuracy with correct calibrated cameras. A stereo camera rig mounted in an outdoor environment is exposed to temperature changes, which has an impact of the calibration of the cameras.
The aim of the master thesis was to investigate the thermal impact of a calibrated stereo camera rig. This was performed by placing a stereo rig in a temperature chamber and collect data of a calibration board at different temperatures. Data was collected with two different cameras and lensesand used for calibration of the stereo camera rig for different scenarios. The obtained parameters were plotted and analyzed.
The result from the master thesis gives that the thermal variation has an impact of the accuracy of the calibrated stereo camera rig. A calibration obtained in one temperature can not be used for a different temperature without a degradation of the accuracy. The plotted parameters from the calibration had a high noise level due to problems with the calibration methods, and no visible trend from temperature changes could be seen.
@mastersthesis{diva2:941863,
author = {Andersson, Elin},
title = {{Thermal Impact of a Calibrated Stereo Camera Rig}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4980--SE}},
year = {2016},
address = {Sweden},
}
Segmentation of the brain into sub-volumes has many clinical applications. Manyneurological diseases are connected with brain atrophy (tissue loss). By dividingthe brain into smaller compartments, volume comparison between the compartmentscan be made, as well as monitoring local volume changes over time. Theformer is especially interesting for the left and right cerebral hemispheres, dueto their symmetric appearance. By using automatic segmentation, the time consumingstep of manually labelling the brain is removed, allowing for larger scaleresearch.In this thesis, three automatic methods for segmenting the brain from magneticresonance (MR) images are implemented and evaluated. Since neither ofthe evaluated methods resulted in sufficiently good segmentations to be clinicallyrelevant, a novel segmentation method, called SB-GC (shape bottleneck detectionincorporated in graph cuts), is also presented. SB-GC utilizes quantitative MRIdata as input data, together with shape bottleneck detection and graph cuts tosegment the brain into the left and right cerebral hemispheres, the cerebellumand the brain stem. SB-GC shows promises of highly accurate and repeatable resultsfor both healthy, adult brains and more challenging cases such as childrenand brains containing pathologies.
@mastersthesis{diva2:933699,
author = {Stacke, Karin},
title = {{Automatic Brain Segmentation into Substructures Using Quantitative MRI}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4956--SE}},
year = {2016},
address = {Sweden},
}
The usage of 3D modelling is increasing fast, both for civilian and military areas, such as navigation, targeting and urban planning. When creating a 3D model from satellite images, clouds canbe problematic. Thus, automatic detection ofclouds inthe imagesis ofgreat use. This master thesis was carried out at Vricon, who produces 3D models of the earth from satellite images.This thesis aimed to investigate if Support Vector Machines could classify pixels into cloud or non-cloud, with a combination of texture and color as features. To solve the stated goal, the task was divided into several subproblems, where the first part was to extract features from the images. Then the images were preprocessed before fed to the classifier. After that, the classifier was trained, and finally evaluated.The two methods that gave the best results in this thesis had approximately 95 % correctly classified pixels. This result is better than the existing cloud segmentation method at Vricon, for the tested terrain and cloud types.
@mastersthesis{diva2:932606,
author = {Gasslander, Maja},
title = {{Segmentation of Clouds in Satellite Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4945--SE}},
year = {2016},
address = {Sweden},
}
Face recognition is the problem of identifying individuals in images. This thesis evaluates two methods used to determine if pairs of face images belong to the same individual or not. The first method is a combination of principal component analysis and a neural network and the second method is based on state-of-the-art convolutional neural networks. They are trained and evaluated using two different data sets. The first set contains many images with large variations in, for example, illumination and facial expression. The second consists of fewer images with small variations.
Principal component analysis allowed the use of smaller networks. The largest network has 1.7 million parameters compared to the 7 million used in the convolutional network. The use of smaller networks lowered the training time and evaluation time significantly. Principal component analysis proved to be well suited for the data set with small variations outperforming the convolutional network which need larger data sets to avoid overfitting. The reduction in data dimensionality, however, led to difficulties classifying the data set with large variations. The generous amount of images in this set allowed the convolutional method to reach higher accuracies than the principal component method.
@mastersthesis{diva2:931705,
author = {Habrman, David},
title = {{Face Recognition with Preprocessing and Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4953--SE}},
year = {2016},
address = {Sweden},
}
In the field of Natural Language Processing, supervised machine learning is commonly used to solve classification tasks such as sentiment analysis and text categorization. The classical way of representing the text has been to use the well known Bag-Of-Words representation. However lately low-dimensional dense word vectors have come to dominate the input to state-of-the-art models. While few studies have made a fair comparison of the models' sensibility to the text representation, this thesis tries to fill that gap. We especially seek insight in the impact various unsupervised pre-trained vectors have on the performance. In addition, we take a closer look at the Random Indexing representation and try to optimize it jointly with the classification task. The results show that while low-dimensional pre-trained representations often have computational benefits and have also reported state-of-the-art performance, they do not necessarily outperform the classical representations in all cases.
@mastersthesis{diva2:928411,
author = {Norlund, Tobias},
title = {{The Use of Distributional Semantics in Text Classification Models:
Comparative performance analysis of popular word embeddings}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4926--SE}},
year = {2016},
address = {Sweden},
}
The art of reconstructing a real-world scene digitally has been on the mind of researchers for decades. Recently, it has attracted more and more attention from companies seeing a chance to bring this kind of technology to the market. Digital reconstruction of buildings in particular is a niche that has both potential and room for improvement. With this background, this thesis will present the design and evaluation of a pipeline made to find and correct approximately flat surfaces in architectural scenes. The scenes are 3D-reconstructed triangle meshes based on RGB images. The thesis will also comprise an evaluation of a few different components available for doing this, leading to a choice of best components. The goal is to improve the visual quality of the reconstruction.
The final pipeline is designed with two blocks - one to detect initial plane seeds and one to refine the detected planes. The first block makes use of a multi-label energy formulation on the graph that describes the reconstructed surface. Penalties are assigned to each vertex and each edge of the graph based on the vertex labels, effectively describing a Markov Random Field. The energy is minimized with the help of the alpha-expansion algorithm. The second block uses heuristics for growing the detected plane seeds, merging similar planes together and extracting deviating details.
Results on several scenes are presented, showing that the visual quality has been improved while maintaining accuracy compared with ground truth data.
@mastersthesis{diva2:917230,
author = {Jonsson, Mikael},
title = {{Make it Flat:
Detection and Correction of Planar Regions in Triangle Meshes}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4930--SE}},
year = {2016},
address = {Sweden},
}
Lung cancer is the most common type of cancer in the world and always manifests as lung nodules. Nodules are small tumors that consist of lung tissue. They are usually spherical in shape and their cores can be either solid or subsolid. Nodules are common in lungs, but not all of them are malignant. To determine if a nodule is malignant or benign, attributes like nodule size and volume growth are commonly used. The procedure to obtain these attributes is time consuming, and therefore calls for tools to simplify the process.
The purpose of this thesis work was to investigate the feasibility of a semi-automatic lungnodule segmentation pipeline including volume estimation. This was done by implementing, tuning and evaluating image processing algorithms with different characteristics to create pipeline candidates. These candidates were compared using a similarity index between their segmentation results and ground truth markings to determine the most promising one.
The best performing pipeline consisted of a fixed region of interest together with a level set segmentation algorithm. Its segmentation accuracy was not consistent for all nodules evaluated, but the pipeline showed great potential when dynamically adapting its parameters for each nodule. The use of dynamic parameters was only brie y explored, and further research would be necessary to determine its feasibility.
@mastersthesis{diva2:911649,
author = {Berglin, Lukas},
title = {{Design, Evaluation and Implementation of a Pipeline for Semi-Automatic Lung Nodule Segmentation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--16/4925--SE}},
year = {2016},
address = {Sweden},
}
Generic visual tracking is one of the classical problems in computer vision. In this problem, no prior knowledge of the target is available aside from a bounding box in the initial frame of the sequence. The generic visual tracking is a difficult task due to a number of factors such as momentary occlusions, target rotations, changes in target illumination and variations in the target size. In recent years, discriminative correlation filter (DCF) based trackers have shown promising results for visual tracking. These DCF based methods use the Fourier transform to efficiently calculate detection and model updates, allowing significantly higher frame rates than competing methods. However, existing DCF based methods only estimate translation of the object while ignoring changes in size.This thesis investigates the problem of accurately estimating the scale variations within a DCF based framework. A novel scale estimation method is proposed by explicitly constructing translation and scale filters. The proposed scale estimation technique is robust and significantly improve the tracking performance, while operating at real-time. In addition, a comprehensive evaluation of feature representations in a DCF framework is performed. Experiments are performed on the benchmark OTB-2015 dataset, as well as the VOT 2014 dataset. The proposed methods are shown to significantly improve the performance of existing DCF based trackers.
@mastersthesis{diva2:910736,
author = {Häger, Gustav},
title = {{Improving Discriminative Correlation Filters for Visual Tracking}},
school = {Linköping University},
type = {{LiTH-ISY-EX-15/4919--SE}},
year = {2015},
address = {Sweden},
}
The car has increasingly become more and more intelligent throughout the years. Today's radar and vision based safety systems can warn a driver and brake the vehicle automatically if obstacles are detected. Research projects such as the Google Car have even succeeded in creating fully autonomous cars.
The demands to obtain the highest rating in safety tests such as Euro NCAP are also steadily increasing, and as a result, the development of these systems have become more attractive for car manufacturers. In the near future, a car must have a system for detecting, and performing automatic braking for pedestrians to receive the highest safety rating of five stars. The prospect is that the volume of active safety system will increase drastically when the car manufacturers start installing them in not only luxury cars, but also in the regularly priced ones. The use of automatic braking comes with a high demand on the performance of active safety systems, false positives must be avoided at all costs.
Dollar et al. [2014] introduced Aggregated Channel Features (ACF) which is based on a 10-channel LUV+HOG feature map. The method uses decision trees learned from boosting and has been shown to outperform previous algorithms in object detection tasks. The rediscovery of neural networks, and especially Convolutional Neural Networks (CNN) has increased the performance in almost every field of machine learning, including pedestrian detection. Recently Yang et al.[2015] combined the two approaches by using the the feature maps from a CNN as input to a decision tree based boosting framework. This resulted in state of the art performance on the challenging Caltech pedestrian data set.
This thesis presents an approach to improve the performance of a cascade of boosted classifiers by investigating the impact of using color information for pedestrian detection. The color self similarity feature introduced by Walk et al.[2010] was used to create a version better adapted for boosting. This feature is then used in combination with a gradient based feature at the last step of a cascade.
The presented feature increases the performance compared to currently used classifiers at Autoliv, on data recorded by Autoliv and on the benchmark Caltech pedestrian data set.
@mastersthesis{diva2:867888,
author = {Hansson, Niklas},
title = {{Color Features for Boosted Pedestrian Detection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4899--SE}},
year = {2015},
address = {Sweden},
}
In the steel industry, laser triangulation based measurement systems can be utilizedfor evaluating the flatness of the steel products. Shapeline is a company in Linköpingthat manufactures such measurement systems. This thesis work will present a series ofexperiments on a Shapeline measurement system in a relatively untested environment, thehot rolling mill at SSAB in Borlänge.The purpose of this work is to evaluate how the conditions at a hot rolling mill affectsthe measurement performance. It has been anticipated that measuring in high temperatureenvironment would introduce difficulties that do not exist when measuring in cold environments.A number of different experiments were conducted, where equipment such as laserand camera bandpass filter were alternated. Via the experiments, information about noisedue to the environment in the hot rolling mill was gained. The most significant noise wascaused by heat shimmering. Using the presented methods, the magnitude and frequencyspectrum of the heat shimmering noise could be determined. The results also indicates thatheat shimmering cause large errors and is quite troublesome to counter. In addition to this,the quality of the line detections under the hot rolling mill circumstances was examined. Itcould be observed that the line detections did not introduce any significant errors despitethe harmful conditions.
@mastersthesis{diva2:857691,
author = {Larsson, Oliver},
title = {{Evaluation of Flatness Gauge for Hot Rolling Mills}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4894--SE}},
year = {2015},
address = {Sweden},
}
In a time when cattle herds grow continually larger the need for automatic methods to detect diseases is ever increasing. One possible method to discover diseases is to use thermal images and automatic head and eye detectors. In this thesis an eye detector and a head detector is implemented using the Random Forests classifier. During the implementation the classifier is evaluated using three different descriptors: Histogram of Oriented Gradients, Local Binary Patterns, and a descriptor based on pixel differences. An alternative classifier, the Support Vector Machine, is also evaluated for comparison against Random Forests.
The thesis results show that Histogram of Oriented Gradients performs well as a description of cattle heads, while Local Binary Patterns performs well as a description of cattle eyes. The provided descriptor performs almost equally well in both cases. The results also show that Random Forests performs approximately as good as the Support Vector Machine, when the Support Vector Machine is paired with Local Binary Patterns for both heads and eyes.
Finally the thesis results indicate that it is easier to detect and locate cattle heads than it is to detect and locate cattle eyes. For eyes, combining a head detector and an eye detector is shown to give a better result than only using an eye detector. In this combination heads are first detected in images, followed by using the eye detector in areas classified as heads.
@mastersthesis{diva2:856339,
author = {Sandsveden, Daniel},
title = {{Evaluation of Random Forests for Detection and Localization of Cattle Eyes}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4885--SE}},
year = {2015},
address = {Sweden},
}
Anomaly detection is a general theory of detecting unusual patterns or events in data. This master thesis investigates the subject of anomaly detection in two different applications. The first application is product inspection using a camera and the second application is surveillance using a 2D laser scanner.
The first part of the thesis presents a system for automatic visual defect inspection. The system is based on aligning the images of the product to a common template and doing pixel-wise comparisons. The system is trained using only images of products that are defined as normal, i.e. non-defective products. The visual properties of the inspected products are modelled using three different methods. The performance of the system and the different methods have been evaluated on four different datasets.
The second part of the thesis presents a surveillance system based on a single laser range scanner. The system is able to detect certain anomalous events based on the time, position and velocities of individual objects in the scene. The practical usefulness of the system is made plausible by a qualitative evaluation using unlabelled data.
@mastersthesis{diva2:855502,
author = {Thulin, Peter},
title = {{Anomaly Detection for Product Inspection and Surveillance Applications}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4889--SE}},
year = {2015},
address = {Sweden},
}
Integrated camera systems for increasing safety and maneuverability are becoming increasingly common for heavy vehicles. One problem with heavy vehicles today is that there are blind spots where the driver has no or very little view. There is a great demand on increasing the safety and helping the driver to get a better view of his surroundings. This can be achieved by a sophisticated camera system, using cameras with wide field of view, that could cover dangerous blind spots.
This master thesis aims to investigate and develop a prototype solution for a camera system consisting of two fisheye cameras. The solution covers both hardware choices and software development including camera calibration and image stitching. Two different fisheye camera calibration toolboxes are compared and their results discussed, with the aim to find the most suitable for this application. The result from the two toolboxes differ in performance, and the result from only one of the toolboxes is sufficient for image stitching.
@mastersthesis{diva2:854521,
author = {Söderroos, Anna},
title = {{Fisheye Camera Calibration and Image Stitching for Automotive Applications}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4887--SE}},
year = {2015},
address = {Sweden},
}
The Kinect v2 is a RGB-D sensor manufactured as a gesture interaction tool for the entertainment console XBOX One. In this thesis we will use it to perform 3D reconstruction and investigate its ability to measure depth. In order to sense both color and depth the Kinect v2 has two cameras: one RGB camera and one infrared camera used to produce depth and near infrared images. These cameras need to be calibrated if we want to use them for 3D reconstruction. We present a calibration procedure for simultaneously calibrating the cameras and extracting their relative pose. This enables us to construct colored meshes of the environment. When we know the camera parameters of the infrared camera, the depth images could be used to perform the Kinectfusion algorithm. This produces well-formed meshes of the environment by combining many depth frames taken from several camera poses.The Kinect v2 uses a time-of-flight technology were the phase shifts are extracted from amplitude modulated infrared light signals produced by an emitter. The extracted phase shifts are then converted to depth values. However, the extraction of phase shifts includes a phase unwrapping procedure, which is sensitive to noise and can result in large depth errors.By utilizing the ability to access the raw phase measurements from the device we managed to modify the phase unwrapping procedure. This new procedure includes an extraction of several hypotheses for the unwrapped phase and a spatial propagation to select amongst them. This proposed method has been compared with the available drivers in the open source library libfreenect2 and the Microsoft Kinect SDK v2. Our experiments show that the depth images of the two available drivers have similar quality and our proposed method improves over libfreenect2. The calculations in the proposed method are more expensive than those in libfreenect2 but it still runs at 2.5× real time. However, contrary to libfreenect2 the proposed method lacks a filter that removes outliers from the depth images. It turned out that this is an important feature when performing Kinect fusion and future work should thus be focused on adding an outlier filter.
@mastersthesis{diva2:854680,
author = {Järemo Lawin, Felix},
title = {{Depth Data Processing and 3D Reconstruction Using the Kinect v2}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4884--SE}},
year = {2015},
address = {Sweden},
}
Pedestrian detection is an important field with applications in active safety systems for cars as well as autonomous driving. Since autonomous driving and active safety are becoming technically feasible now the interest for these applications has dramatically increased.The aim of this thesis is to investigate convolutional neural networks (CNN) for pedestrian detection. The reason for this is that CNN have recently beensuccessfully applied to several different computer vision problems. The main applications of pedestrian detection are in real time systems. For this reason,this thesis investigates strategies for reducing the computational complexity offorward propagation for CNN.The approach used in this thesis for extracting pedestrians is to use a CNN tofind a probability map of where pedestrians are located. From this probabilitymap bounding boxes for pedestrians are generated. A method for handling scale invariance for the objects of interest has also been developed in this thesis. Experiments show that using this method givessignificantly better results for the problem of pedestrian detection.The accuracy which this thesis has managed to achieve is similar to the accuracy for some other works which use CNN.
@mastersthesis{diva2:839692,
author = {Molin, David},
title = {{Pedestrian Detection Using Convolutional Neural Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4855--SE}},
year = {2015},
address = {Sweden},
}
The poaching of rhinoceros has increased dramatically the last few years andthe park rangers are often helpless against the militarised poachers. LinköpingUniversity is running several projects with the goal to aid the park rangers intheir work.This master thesis was produced at CybAero AB, which builds Remotely PilotedAircraft System (RPAS). With their helicopters, high end cameras with a rangesufficient to cover the whole area can be flown over the parks.The aim of this thesis is to investigate different methods to automatically findrhinos and humans, using airborne cameras. The system uses two cameras, onecolour camera and one thermal camera. The latter is used to find interestingobjects which are then extracted in the colour image. The object is then classifiedas either rhino, human or other. Several methods for classification have beenevaluated.The results show that classifying solely on the thermal image gives nearly as highaccuracy as classifying only in combination with the colour image. This enablesthe system to be used in dusk and dawn or in bad light conditions. This is animportant factor since most poaching occurs at dusk or dawn. As a conclusion asystem capable of running on low performance hardware and placeable on boardthe aircraft is presented.
@mastersthesis{diva2:843745,
author = {Karlsson Schmidt, Carl},
title = {{Rhino and Human Detection in Overlapping RGB and LWIR Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4837--SE}},
year = {2015},
address = {Sweden},
}
In this thesis a system for creating panoramic video has been developed. The panoramic video is formed by stitching several camera streams together. The system is designed as a vehicle mounted system, but can be applied to several other areas, such as surveillance. The system creates the video by finding features that correspond in the overlapping frames. By using cylinder projection the problem is reduced to finding a translation between the images and using algorithms such as ORB matching features can be detected and described. The camera frames are stitched together by calculating the average translation of the matching features. To reduce artifacts such as ghosting, a simple but effective alpha blending technique has been used. The system has been implemented using C++ and the OpenCV library and the algorithm is capable of processing about 15 frames per second making it close to real-time. With future improvements, such as parallel processing of the cameras, the system may be speeded up even further and possibly include other types of image processing, e.g. object recognition and tracking.
@mastersthesis{diva2:822602,
author = {Rydholm, Niklas},
title = {{Panoramic Video Stitching}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4858--SE}},
year = {2015},
address = {Sweden},
}
Machine learning can be utilized in many different ways in the field of automatic manufacturing and logistics. In this thesis supervised machine learning have been utilized to train a classifiers for detection and recognition of objects in images. The techniques AdaBoost and Random forest have been examined, both are based on decision trees.
The thesis has considered two applications: barcode detection and optical character recognition (OCR). Supervised machine learning methods are highly appropriate in both applications since both barcodes and printed characters generally are rather distinguishable.
The first part of this thesis examines the use of machine learning for barcode detection in images, both traditional 1D-barcodes and the more recent Maxi-codes, which is a type of two-dimensional barcode. In this part the focus has been to train classifiers with the technique AdaBoost. The Maxi-code detection is mainly done with Local binary pattern features. For detection of 1D-codes, features are calculated from the structure tensor. The classifiers have been evaluated with around 200 real test images, containing barcodes, and shows promising results.
The second part of the thesis involves optical character recognition. The focus in this part has been to train a Random forest classifier by using the technique point pair features. The performance has also been compared with the more proven and widely used Haar-features. Although, the result shows that Haar-features are superior in terms of accuracy. Nevertheless the conclusion is that point pairs can be utilized as features for Random forest in OCR.
@mastersthesis{diva2:822575,
author = {Fridolfsson, Olle},
title = {{Machine Learning:
for Barcode Detection and OCR}},
school = {Linköping University},
type = {{LiTH-ISY-Ex--15/4842--SE}},
year = {2015},
address = {Sweden},
}
In a synchronized multi camera system it is imperative that the synchronization error between the different cameras is as close to zero as possible and the jitter of the presumed frame rate is as small as possible. It is even more important when these systems are used in an autonomous vehicle trying to sense its surroundings. We would never hand over the control to a autonomous vehicle if we couldn't trust the data it is using for moving around.
The purpose of this thesis was to build a synchronization setup for a multi camera system using state of the art RayTrix digital cameras that will be used in the iQMatic project involving autonomous heavy duty vehicles. The iQMatic project is a collaboration between several Swedish industrial partners and universities. There was also software development for the multi camera system involved. Different synchronization techniques were implemented and then analysed against the system requirements. The two techniques were hardware trigger i.e. external trigger using a microcontroller, and software trigger using the API from the digital cameras.
Experiments were conducted by testing the different trigger modes with the developed multi camera software. The conclusions show that the hardware trigger is preferable in this particular system by showing more stability and better statistics against the system requirements than the software trigger. But the thesis also show that additional experiments are needed for a more accurate analysis.
@mastersthesis{diva2:822340,
author = {Vibeck, Alexander},
title = {{Synchronization of a Multi Camera System}},
school = {Linköping University},
type = {{LiTH-ISY-EX-ET--15/0438--SE}},
year = {2015},
address = {Sweden},
}
In the field of industrial automation large savings can be realized if position andorientation of an object is known. Knowledge about an objects position and orien-tation can be used by advanced robotic systems to be able to work with complexitems. Specifically 2D-objects are a big enough sub domain to motivate specialattention. Traditionally this problem has been solved with large mechanical sys-tems that forces the objects into specific configurations. Besides being expensive,taking up a lot of space and having great difficulty handling fragile items, thesemechanical systems have to be constructed for each particular type of object. Thisthesis explores the possibility of using registration algorithms from computer vi-sion based on 3D-data to find flat objects. While systems for locating 3D objectsalready exists they have issues with locating essentially flat objects since theirpositioning is mostly a function of their contour. The thesis consists of a briefexamination of 2D-algorithms and their extension to 3D as well as results fromthe most suitable algorithm.
@mastersthesis{diva2:821158,
author = {Ingberg, Benjamin},
title = {{Registration of 2D Objects in 3D data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--15/4848--SE}},
year = {2015},
address = {Sweden},
}
Autonomous Driving or self driving vehicles are concepts of vehicles knowing their environment and making driving manoeuvres without instructions from a driver. The concepts have been around for decades but has improved significantly in the last years since research in this area has made significant progress. Benefits of autonomous driving include the possibility to decrease the number of accidents in traffic and thereby saving lives.
A major challenge in autonomous driving is to acquire 3D information and relations between all objects in surrounding traffic. This is referred to as \textit{spatial perception}. Stereo camera systems have become a central sensor module for advanced driver assistance systems and autonomous driving. For object detection and measurements at large distances stereo vision encounter difficulties. This includes objects being small, having low contrast and the presence of image noise. Having an accurate perception of the environment at large distances is however of high interest for many applications, especially autonomous driving.
This thesis proposes a method which tries to increase the range to where generic objects are first detected using a given stereo camera setup. Objects are represented by planes in 3D space. The input image is segmented into the various objects and the 3D plane parameters are estimated jointly. The 3D plane parameters are estimated directly from the stereo image pairs. In particular, this thesis investigates methods to introduce geometric constraints to the segmentation or labeling task, i.e assigning each considered pixel in the image to a plane.
The methods provided in this thesis show that despite the difficulties at large distances it is possible to exploit planar primitives in 3D space for obstacle detection at distances where other methods fail.
@mastersthesis{diva2:778457,
author = {Hillgren, Patrik},
title = {{Geometric Scene Labeling for Long-Range Obstacle Detection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4819--SE}},
year = {2015},
address = {Sweden},
}
A classic Computer Vision task is the estimation of a 3D map from a collection of images. This thesis explores the online simultaneous estimation of camera poses and map points, often called Visual Simultaneous Localisation and Mapping [VSLAM]. In the near future the use of visual information by autonomous cars is likely, since driving is a vision dominated process. For example, VSLAM could be used to estimate the position of the car in relation to objects of interest, such as the road, other cars and pedestrians. Aimed at the creation of a real-time, robust, loop closing, single camera SLAM system, the properties of several state-of-the-art VSLAM systems and related techniques are studied. The system goals cover several important, if difficult, problems, which makes a solution widely applicable. This thesis makes two contributions: A rigorous qualitative analysis of VSLAM methods and a system designed accordingly. A novel tracking by matching scheme is proposed, which, unlike the trackers used by many similar systems, is able to deal better with forward camera motion. The system estimates general motion with loop closure in real time. The system is compared to a state-of-the-art monocular VSLAM algorithm and found to be similar in speed and performance.
@mastersthesis{diva2:771912,
author = {Persson, Mikael},
title = {{Online Monocular SLAM:
Rittums}},
school = {Linköping University},
type = {{Lith-ISY-EX--13/4741-SE}},
year = {2014},
address = {Sweden},
}
In this thesis we study the problem of multi-session dense rgb-d slam for 3D reconstruc- tion. Multi-session reconstruction can allow users to capture parts of an object that could not easily be captured in one session, due for instance to poor accessibility or user mistakes. We first present a thorough overview of single-session dense rgb-d slam and describe the multi-session problem as a loosening of the incremental camera movement and static scene assumptions commonly held in the single-session case. We then implement and evaluate sev- eral variations on a system for doing two-session reconstruction as an extension to a single- session dense rgb-d slam system.
The extension from one to several sessions is divided into registering separate sessions into a single reference frame, re-optimizing the camera trajectories, and fusing together the data to generate a final 3D model. Registration is done by matching reconstructed models from the separate sessions using one of two adaptations on a 3D object detection pipeline. The registration pipelines are evaluated with many different sub-steps on a challenging dataset and it is found that robust registration can be achieved using the proposed methods on scenes without degenerate shape symmetry. In particular we find that using plane matches between two sessions as constraints for as much as possible of the registration pipeline improves results.
Several different strategies for re-optimizing camera trajectories using data from both ses- sions are implemented and evaluated. The re-optimization strategies are based on re- tracking the camera poses from all sessions together, and then optionally optimizing over the full problem as represented on a pose-graph. The camera tracking is done by incrementally building and tracking against a tsdf volume, from which a final 3D mesh model is extracted. The whole system is qualitatively evaluated against a realistic dataset for multi-session re- construction. It is concluded that the overall approach is successful in reconstructing objects from several sessions, but that other fine grained registration methods would be required in order to achieve multi-session reconstructions that are indistinguishable from singe-session results in terms of reconstruction quality.
@mastersthesis{diva2:772448,
author = {Widebäck West, Nikolaus},
title = {{Multiple Session 3D Reconstruction using RGB-D Cameras}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4814--SE}},
year = {2014},
address = {Sweden},
}
The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers such as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magnitude, for suitable algorithms. For embedded systems, GPU:s are not as popular yet. The embedded GPU:s available on the market have often not been able to justify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been too hard to get, too energy consuming and not suitable for some algorithms. At SICK IVP, advanced computer vision algorithms run on FPGA:s. This master thesis optimizes two such algorithms for embedded GPU:s and evaluates the result. It also evaluates the status of the embedded GPU:s on the market today. The results indicates that embedded GPU:s perform well enough to run the evaluatedd algorithms as fast as needed. The implementations are also easy to understand compared to implementations for FPGA:s which are competing hardware.
@mastersthesis{diva2:768419,
author = {Nilsson, Mattias},
title = {{Evaluation of Computer Vision Algorithms Optimized for Embedded GPU:s.}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4816--SE}},
year = {2014},
address = {Sweden},
}
Visual simultaneous localization and mapping (SLAM) as field has been researched for ten years, but with recent advances in mobile performance visual SLAM is entering the consumer market in a completely new way. A visual SLAM system will however be sensitive to non cautious use that may result in severe motion, occlusion or poor surroundings in terms of visual features that will cause the system to temporarily fail. The procedure of recovering from such a fail is called relocalization. Together with two similar problems localization, to find your position in an existing SLAM session, and loop closing, the online reparation and perfection of the map in an active SLAM session, these can be grouped as visual location recognition (VLR).
This thesis presents novel results by combining the scalability of FabMap and the precision of 13th Lab's tracking yielding high-precision VLR, +/- 10 cm, while maintaining above 99 % precision and 60 % recall for sessions containing thousands of images. Everything functional purely on a normal mobile phone.
The applications of VLR are many. Indoors, where GPS is not functioning, VLR can still provide positional information and navigate you through big complexes like airports and museums. Outdoors, VLR can improve the precision of GPS tenfold yielding a new level of navigational experience. Virtual and augmented reality applications are other areas that benefit from improved positioning and localization.
@mastersthesis{diva2:767444,
author = {Sjöholm, Alexander},
title = {{Closing the Loop:
Mobile Visual Location Recognition}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4813--SE}},
year = {2014},
address = {Sweden},
}
The usage of 3D-modeling is expanding rapidly. Modeling from aerial imagery has become very popular due to its increasing number of both civilian and mili- tary applications like urban planning, navigation and target acquisition.
This master thesis project was carried out at Vricon Systems at SAAB. The Vricon system produces high resolution geospatial 3D data based on aerial imagery from manned aircrafts, unmanned aerial vehicles (UAV) and satellites.
The aim of this work was to investigate to what degree superpixel segmentation and supervised learning can be applied to a terrain classification problem using imagery and digital surface models (dsm). The aim was also to investigate how the height information from the digital surface model may contribute compared to the information from the grayscale values. The goal was to identify buildings, trees and ground. Another task was to evaluate existing methods, and compare results.
The approach for solving the stated goal was divided into several parts. The first part was to segment the image using superpixel segmentation, after that features were extracted. Then the classifiers were created and trained and finally the classifiers were evaluated.
The classification method that obtained the best results in this thesis had approx- imately 90 % correctly labeled superpixels. The result was equal, if not better, compared to other solutions available on the market.
@mastersthesis{diva2:767120,
author = {Ringqvist, Sanna},
title = {{Classification of terrain using superpixel segmentation and supervised learning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4752--SE}},
year = {2014},
address = {Sweden},
}
The Next-Best-View (NBV) problem plays an important part in automatic 3D object reconstruction and exploration applications. This thesis presents a novel approach of ray-casting in Occupancy Grid Maps (OGM) in the context of solving the NBV problem in a 3D-exploration setting. The proposed approach utilizes the structure of an octree-based OGM to perform calculations of potential information gain. The computations are significantly faster than current methods, without decreasing mapping quality. Performance, both in terms of mapping quality, coverage and computational complexity, is experimentally verified through a comparison with existing state-of-the-art methods using high-resolution point cloud data generated using time-of-flight laser range scanners.
Current methods for viewpoint ranking focus either heavily on mapping performance or computation speed. The results presented in this thesis indicate that the proposed method is able to achieve a mapping performance similar to the performance-oriented approaches while maintaining the same low computation speed as more approximative methods.
@mastersthesis{diva2:761834,
author = {Svensson, Martin},
title = {{Accelerated Volumetric Next-Best-View Planning in 3D Mapping}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4801--SE}},
year = {2014},
address = {Sweden},
}
Many methods have been developed for visual tracking of generic objects. The vast majority of these assume the world is two-dimensional, either ignoring the third dimension or only dealing with it indirectly. This causes difficulties for the tracker when the target approaches or moves away from the camera, is occluded or moves out of the camera frame.
Unmanned aerial vehicles (UAVs) are increasingly used in civilian applications and some of these will undoubtedly carry tracking systems in the future. As they move around, these trackers will encounter both scale changes and occlusions. To improve the tracking performance in these cases, the third dimension should be taken into account.
This thesis extends the capabilities of a 2D tracker to three dimensions, with the assumption that the target moves on a ground plane.
The position of the tracker camera is established by matching the video it produces to a sparse point-cloud map built with off-the-shelf structure-from-motion software. A target is tracked with a generic 2D tracker and subsequently positioned on the ground. Should the target disappear from view, its motion on the ground is predicted. In combination, these simple techniques are shown to improve the robustness of a tracking system on a moving platform under target scale changes and occlusions.
@mastersthesis{diva2:761603,
author = {Robinson, Andreas},
title = {{Implementation and evaluation of a 3D tracker}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4800--SE}},
year = {2014},
address = {Sweden},
}
A lane position system and enhancement techniques, for increasing the robustnessand availability of such a system, are investigated. The enhancements areperformed by using additional sensor sources like map data and GPS. The thesiscontains a description of the system, two models of the system and two implementedfilters for the system. The thesis also contains conclusions and results oftheoretical and experimental tests of the increased robustness and availability ofthe system. The system can be integrated with an existing system that investigatesdriver behavior, developed for fatigue. That system was developed in aproject named Drowsi, where among others Volvo Technology participated.
@mastersthesis{diva2:749036,
author = {Landberg, Markus},
title = {{Enhancement Techniques for Lane PositionAdaptation (Estimation) using GPS- and Map Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4788--SE}},
year = {2014},
address = {Sweden},
}
High resolution 3D images are of high interest in military operations, where data can be used to classify and identify targets. The Swedish defence research agency (FOI) is interested in the latest research and technologies in this area. A draw- back with normal 3D-laser systems are the lack of high resolution for long range measurements. One technique for high long range resolution laser radar is based on time correlated single photon counting (TCSPC). By repetitively sending out short laser pulses and measure the time of flight (TOF) of single reflected pho- tons, extremely accurate range measurements can be done. A drawback with this method is that it is hard to create single photon detectors with many pixels and high temporal resolution, hence a single detector is used. Scanning an entire scene with one detector is very time consuming and instead, as this thesis is all about, the entire scene can be measured with less measurements than the number of pixels. To do this a technique called compressed sensing (CS) is introduced. CS utilizes that signals normally are compressible and can be represented sparse in some basis representation. CS sets other requirements on the sampling compared to the normal Shannon-Nyquist sampling theorem. With a digital micromirror device (DMD) linear combinations of the scene can be reflected onto the single photon detector, creating scalar intensity values as measurements. This means that fewer DMD-patterns than the number of pixels can reconstruct the entire 3D-scene. In this thesis a computer model of the laser system helps to evaluate different CS reconstruction methods with different scenarios of the laser system and the scene. The results show how many measurements that are required to reconstruct scenes properly and how the DMD-patterns effect the results. CS proves to enable a great reduction, 85 − 95 %, of the required measurements com- pared to pixel-by-pixel scanning system. Total variation minimization proves to be the best choice of reconstruction method.
@mastersthesis{diva2:722826,
author = {Fall, Erik},
title = {{Compressed Sensing for 3D Laser Radar}},
school = {Linköping University},
type = {{LiTH-ISY-EX---14/4767---SE}},
year = {2014},
address = {Sweden},
}
Datorseende är ett snabbt växande, tvärvetenskapligt forskningsområde vars tillämpningar tar en allt mer framskjutande roll i dagens samhälle. Med ett ökat intresse för datorseende ökar också behovet av att kunna kontrollera kameror kopplade till datorseende system.
Vid Linköpings tekniska högskola, på avdelningen för datorseende, har ramverket EDSDK++ utvecklats för att fjärrstyra digitala kameror tillverkade av Canon Inc. Ramverket är mycket omfattande och innehåller en stor mängd funktioner och inställningsalternativ. Systemet är därför till stor del ännu relativt oprövat. Detta examensarbete syftar till att utveckla ett demonstratorsystem till EDSDK++ i form av ett enkelt active vision system, som med hjälp av ansiktsdetektion i realtid styr en kameratilt, samt en kamera monterad på tilten, till att följa, zooma in och fokusera på ett ansikte eller en grupp av ansikten. Ett krav var att programbiblioteket OpenCV skulle användas för ansiktsdetektionen och att EDSDK++ skulle användas för att kontrollera kameran. Dessutom skulle ett API för att kontrollera kameratilten utvecklas.
Under utvecklingsarbetet undersöktes bl.a. olika metoder för ansiktsdetektion. För att förbättra prestandan användes multipla ansiktsdetektorer, som med hjälp av multitrådning avsöker en bild parallellt från olika vinklar. Såväl experimentella som teoretiska ansatser gjordes för att bestämma de parametrar som behövdes för att kunna reglera kamera och kameratilt. Resultatet av arbetet blev en demonstrator, som uppfyllde samtliga krav.
@mastersthesis{diva2:722871,
author = {Karg\'{e}n, Rolf},
title = {{Utveckling av ett active vision system för demonstration av EDSDK++ i tillämpningar inom datorseende}},
school = {Linköping University},
type = {{LiTH-ISY-EX-ET--14/0419--SE}},
year = {2014},
address = {Sweden},
}
In recent years several depth cameras have emerged on the consumer market, creating many interesting possibilities forboth professional and recreational usage. One example of such a camera is the Microsoft Kinect sensor originally usedwith the Microsoft Xbox 360 game console. In this master thesis a system is presented that utilizes this device in order to create an as accurate as possible 3D reconstruction of an indoor environment. The major novelty of the presented system is the data structure based on signed distance fields and voxel octrees used to represent the observed environment.
@mastersthesis{diva2:716061,
author = {Bengtsson, Morgan},
title = {{Indoor 3D Mapping using Kinect}},
school = {Linköping University},
type = {{LiTH-ISY-EX--14/4753--SE}},
year = {2014},
address = {Sweden},
}
Visual tracking is a classical computer vision problem with many important applications in areas such as robotics, surveillance and driver assistance. The task is to follow a target in an image sequence. The target can be any object of interest, for example a human, a car or a football. Humans perform accurate visual tracking with little effort, while it remains a difficult computer vision problem. It imposes major challenges, such as appearance changes, occlusions and background clutter. Visual tracking is thus an open research topic, but significant progress has been made in the last few years.
The first part of this thesis explores generic tracking, where nothing is known about the target except for its initial location in the sequence. A specific family of generic trackers that exploit the FFT for faster tracking-by-detection is studied. Among these, the CSK tracker have recently shown obtain competitive performance at extraordinary low computational costs. Three contributions are made to this type of trackers. Firstly, a new method for learning the target appearance is proposed and shown to outperform the original method. Secondly, different color descriptors are investigated for the tracking purpose. Evaluations show that the best descriptor greatly improves the tracking performance. Thirdly, an adaptive dimensionality reduction technique is proposed, which adaptively chooses the most important feature combinations to use. This technique significantly reduces the computational cost of the tracking task. Extensive evaluations show that the proposed tracker outperform state-of-the-art methods in literature, while operating at several times higher frame rate.
In the second part of this thesis, the proposed generic tracking method is applied to human tracking in surveillance applications. A causal framework is constructed, that automatically detects and tracks humans in the scene. The system fuses information from generic tracking and state-of-the-art object detection in a Bayesian filtering framework. In addition, the system incorporates the identification and tracking of specific human parts to achieve better robustness and performance. Tracking results are demonstrated on a real-world benchmark sequence.
@mastersthesis{diva2:709327,
author = {Danelljan, Martin},
title = {{Visual Tracking}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4736--SE}},
year = {2013},
address = {Sweden},
}
Examensarbetet har gått ut på att undersöka om det är möjligt att jämföra foto- grafier på havsbotten tagna med en kamera monterad på SAAB Dynamics farkost AUV-62, här kallad Sapphires, med SONAR-bilder tagna från samma farkost men vid ett annat tillfälle. Föremål avbildade med kamera och sidriktade SONAR:er delar i normalfallet inte visuellt utseende och är därför svåra att jämföra. Meto- den som valts för att jämföra kamera- och SONAR-bilderna grundar sig av denna anledning inte på föremåls individuella utseede utan på mönster skapade av flera föremål. Föremål i bilderna identifieras som objekt, vilka beskrivs med en posi- tion i long. lat. och radie. I kamerabilderna identifieras objekt genom att segmen- tera bilderna med hjälp av MSER, där stenar och andra föremål har ett avvikand utseende från bakgrunden bestående av sand. I SONAR-bilden identifieras om- råden med objekt genom att studera hög intensitet på ekosvaren vilket motsva- rar föremål som reflekterat ljudpulserna bra, från dessa skapas objekten genom att använda MSER på dessa områden. De två uppsättningarna med objekt, från kamera- och SONAR-bilden, jämförs sedan genom att alla objekt i kamera-bilden jämförs med alla objekt i SONAR-bilden genom att translatera efter hypotesen att de är samma objekt och se hur många av dem övriga som passar in i antagandet.
@mastersthesis{diva2:680896,
author = {Ekblad, Richard},
title = {{Korrelering mellan optiskt och akustiskt avbildade objekt på havsbotten}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4742--SE}},
year = {2013},
address = {Sweden},
}
Recording a video sequence with a camera during movement often produces blurred results. This is mainly due to motion blur which is caused by rapid movement of objects in the scene or the camera during recording. By correcting for changes in the orientation of the camera, caused by e.g. uneven terrain, it is possible to minimize the motion blur and thus, produce a stabilized video.
In order to do this, data gathered from a gyroscope and the camera itself can be used to measure the orientation of the camera. The raw data needs to be processed, synchronized and filtered to produce a robust estimate of the orientation. This estimate can then be used as input to some automatic control system in order to correct for changes in the orientation
This thesis focuses on examining the possibility of such a stabilization. The actual stabilization is left for future work. An evaluation of the hardware as well as the implemented methods are done with emphasis on speed, which is crucial in real time computing.
@mastersthesis{diva2:656064,
author = {Gratorp, Eric},
title = {{Evaluation of online hardware video stabilization on a moving platform}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4723--SE}},
year = {2013},
address = {Sweden},
}
In most cases today when a specific person's whereabouts is monitored through video surveillance it is done manually and his or her location when not seen is based on assumptions on how fast he or she can move. Since humans are good at recognizing people this can be done accurately, given good video data, but the time needed to go through all data is extensive and therefore expensive. Because of the rapid technical development computers are getting cheaper to use and therefore more interesting to use for tedious work.
This thesis is a part of a larger project that aims to see to what extent it is possible to estimate a person of interest's time dependent 3D position, when seen in surveillance videos. The surveillance videos are recorded with non overlapping monocular cameras. Furthermore the project aims to see if the person of interest's movement, when position data is unavailable, could be predicted. The outcome of the project is a software capable of following a person of interest's movement with an error estimate visualized as an area indicating where the person of interest might be at a specific time.
This thesis main focus is to implement and evaluate a people detector meant to be used in the project, reduce noise in position measurement, predict the position when the person of interest's location is unknown, and to evaluate the complete project.
The project combines known methods in computer vision and signal processing and the outcome is a software that can be used on a normal PC running on a Windows operating system. The software implemented in the thesis use a Hough transform based people detector and a Kalman filter for one step ahead prediction. The detector is evaluated with known methods such as Miss-rate vs. False Positives per Window or Image (FPPW and FPPI respectively) and Recall vs. 1-Precision.
The results indicate that it is possible to estimate a person of interest's 3D position with single monocular cameras. It is also possible to follow the movement, to some extent, were position data are unavailable. However the software needs more work in order to be robust enough to handle the diversity that may appear in different environments and to handle large scale sensor networks.
@mastersthesis{diva2:652387,
author = {Markström, Johannes},
title = {{3D Position Estimation of a Person of Interest in Multiple Video Sequences:
People Detection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4721--SE}},
year = {2013},
address = {Sweden},
}
Because of the increase in the number of security cameras, there is more video footage available than a human could efficiently process. In combination with the fact that computers are getting more efficient, it is getting more and more interesting to solve the problem of detecting and recognizing people automatically.
Therefore a method is proposed for estimating a 3D-path of a person of interest in multiple, non overlapping, monocular cameras. This project is a collaboration between two master theses. This thesis will focus on recognizing a person of interest from several possible candidates, as well as estimating the 3D-position of a person and providing a graphical user interface for the system. The recognition of the person of interest includes keeping track of said person frame by frame, and identifying said person in video sequences where the person of interest has not been seen before.
The final product is able to both detect and recognize people in video, as well as estimating their 3D-position relative to the camera. The product is modular and any part can be improved or changed completely, without changing the rest of the product. This results in a highly versatile product which can be tailored for any given situation.
@mastersthesis{diva2:650889,
author = {Johansson, Victor},
title = {{3D Position Estimation of a Person of Interest in Multiple Video Sequences:
Person of Interest Recognition}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4718--SE}},
year = {2013},
address = {Sweden},
}
A fully automatic de-weathering system to increase the visibility/stability in surveillance applications during bad weather has been developed. Rain, snow and haze during daylight are handled in real-time performance with acceleration from CUDA implemented algorithms. Video from fixed cameras is processed on a PC with no need of special hardware except an NVidia GPU. The system does not use any background model and does not require any precalibration. Increase in contrast is obtained in all haze/rain/snow-cases while the system lags the maximum of one frame during rain or snow removal. De-hazing can be obtained for any distance to simplify tracking or other operating algorithms on a surveillance system.
@mastersthesis{diva2:647937,
author = {Pettersson, Niklas},
title = {{GPU-Accelerated Real-Time Surveillance De-Weathering}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4677--SE}},
year = {2013},
address = {Sweden},
}
In Sweden and many other northern countries, it is common for heat to be distributed to homes and industries through district heating networks. Such networks consist of pipes buried underground carrying hot water or steam with temperatures in the range of 90-150 C. Due to bad insulation or cracks, heat or water leakages might appear.
A system for large-scale monitoring of district heating networks through remote thermography has been developed and is in use at the company Termisk Systemteknik AB. Infrared images are captured from an aircraft and analysed, finding and indicating the areas for which the ground temperature is higher than normal. During the analysis there are, however, many other warm areas than true water or energy leakages that are marked as detections. Objects or phenomena that can cause false alarms are those who, for some reason, are warmer than their surroundings, for example, chimneys, cars and heat leakages from buildings.
During the last couple of years, the system has been used in a number of cities. Therefore, there exists a fair amount of examples of different types of detections. The purpose of the present master’s thesis is to evaluate the reduction of false alarms of the existing analysis that can be achieved with the use of a learning system, i.e. a system which can learn how to recognize different types of detections.
A labelled data set for training and testing was acquired by contact with customers. Furthermore, a number of features describing the intensity difference within the detection, its shape and propagation as well as proximity information were found, implemented and evaluated. Finally, four different classifiers and other methods for classification were evaluated.
The method that obtained the best results consists of two steps. In the initial step, all detections which lie on top of a building are removed from the data set of labelled detections. The second step consists of classification using a Random forest classifier. Using this two-step method, the number of false alarms is reduced by 43% while the percentage of water and energy detections correctly classified is 99%.
@mastersthesis{diva2:640093,
author = {Berg, Amanda},
title = {{Classification of leakage detections acquired by airborne thermography of district heating networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4678--SE}},
year = {2013},
address = {Sweden},
}
Identification of individuals has been solved with many different solutions around the world, either using biometric data or external means of verification such as id cards or RFID tags. The advantage of using biometric measurements is that they are directly tied to the individual and are usually unalterable. Acquiring dependable measurements is however challenging when the individuals are uncooperative. A dependable system should be able to deal with this and produce reliable identifications.
The system proposed in this thesis can autonomously classify uncooperative specimens from depth data. The data is acquired from a depth camera mounted in an uncontrolled environment, where it was allowed to continuously record for two weeks. This requires stable data extraction and normalization algorithms to produce good representations of the specimens. Robust descriptors can therefore be extracted from each sample of a specimen and together with different classification algorithms, the system can be trained or validated. Even with as many as 138 different classes the system achieves high recognition rates. Inspired by the research field of face recognition, the best classification algorithm, the method of fisherfaces, was able to accurately recognize 99.6% of the validation samples. Followed by two variations of the method of eigenfaces, achieving recognition rates of 98.8% and 97.9%. These results affirm that the capabilities of the system are adequate for a commercial implementation.
@mastersthesis{diva2:635227,
author = {Björkeson, Felix},
title = {{Autonomous Morphometrics using Depth Cameras for Object Classification and Identification}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4680--SE}},
year = {2013},
address = {Sweden},
}
Modern day cars are often equipped with a vision system that collects informa- tion about the car and its surroundings. Camera calibration is extremely impor- tant in order to maintain high accuracy in an automotive safety applications. The cameras are calibrated offline in the factory, however the mounting of the camera may change slowly over time. If the angles of the actual mounting of the cam- era are known compensation for the angles can be done in software. Therefore, online calibration is desirable.
This master’s thesis describes how to dynamically calibrate the roll angle. Two different methods have been implemented and compared.The first detects verti- cal edges in the image, such as houses and lamp posts. The second one method detects license plates on other cars in front of the camera in order to calculate the roll angle.
The two methods are evaluated and the results are discussed. The results of the methods are very varied, and the method that turned out to give the best results was the one that detects vertical edges.
@mastersthesis{diva2:630415,
author = {de Laval, Astrid},
title = {{Online Calibration of Camera Roll Angle}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4688--SE}},
year = {2013},
address = {Sweden},
}
A laser triangulating camera system projects a laser line onto an object to create height curveson the object surface. By moving the object, height curves from different parts of the objectcan be observed and combined to produce a three dimensional representation of the object.The calibration of such a camera system involves transforming received data to get real worldmeasurements instead of pixel based measurements.
The calibration method presented in this thesis focuses specifically on small fields ofview. The goal is to provide an easy to use and robust calibration method that can complementalready existing calibration methods. The tool should get as good measurementsin metric units as possible, while still keeping complexity and production costs of the calibrationobject low. The implementation uses only data from the laser plane itself making itusable also in environments where no external light exist.
The proposed implementation utilises a complete scan of a three dimensional calibrationobject and returns a calibration for three dimensions. The results of the calibration havebeen evaluated against synthetic and real data.
@mastersthesis{diva2:630377,
author = {Rydström, Daniel},
title = {{Calibration of Laser Triangulating Cameras in Small Fields of View}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4669--SE}},
year = {2013},
address = {Sweden},
}
Automatic tracking of an object of interest in a video sequence is a task that has been much researched. Difficulties include varying scale of the object, rotation and object appearance changing over time, thus leading to tracking failures. Different tracking methods, such as short-term tracking often fail if the object steps out of the camera’s field of view, or changes shape rapidly. Also, small inaccuracies in the tracking method can accumulate over time, which can lead to tracking drift. Long-term tracking is also problematic, partly due to updating and degradation of the object model, leading to incorrectly classified and tracked objects.
This master’s thesis implements a long-term tracking framework called Tracking-Learning-Detection which can learn and adapt, using so called P/N-learning, to changing object appearance over time, thus making it more robust to tracking failures. The framework consists of three parts; a tracking module which follows the object from frame to frame, a learning module that learns new appearances of the object, and a detection module which can detect learned appearances of the object and correct the tracking module if necessary.
This tracking framework is evaluated on thermal infrared videos and the results are compared to the results obtained from videos captured within the visible spectrum. Several important differences between visual and thermal infrared tracking are presented, and the effect these have on the tracking performance is evaluated.
In conclusion, the results are analyzed to evaluate which differences matter the most and how they affect tracking, and a number of different ways to improve the tracking are proposed.
@mastersthesis{diva2:627964,
author = {Stigson, Magnus},
title = {{Object Tracking Using Tracking-Learning-Detection inThermal Infrared Video}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4668--SE}},
year = {2013},
address = {Sweden},
}
In factory automation cameras and image processing algorithms can be used to inspect objects. This can decrease the faulty objects that leave the factory and reduce manual labour needed. A vision sensor is a system where camera and image processing is delivered together, and that only needs to be configured for the application that it is to be used for. Thus no programming knowledge is needed for the customer. In this Master’s thesis a way to make the configuration of a vision sensor even easier is developed and evaluated.
The idea is that the customer knows his or her product much better than he or she knows image processing. The customer could take images of positive and negative samples of the object that is to be inspected. The algorithm should then, given these images, configure the vision sensor automatically.
The algorithm that is developed to solve this problem is described step by step with examples to illustrate the problems that needed to be solved. Much of the focus is on how to compare two configurations to each other, in order to find the best one. The resulting configuration from the algorithm is then evaluated with respect to types of applications, computation time and representativeness of the input images.
@mastersthesis{diva2:624443,
author = {Ollesson, Niklas},
title = {{Automatic Configuration of Vision Sensor}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4666--SE}},
year = {2013},
address = {Sweden},
}
In certain industries, quality testing is crucial, to make sure that the components being manufactured do not contain any defects. One method to detect these defects is to heat the specimen being inspected and then to study the cooling process using infrared thermography. The explorations of non-destructive testing using thermography is at an early stage and therefore the purpose of this thesis is to analyse some of the existing techniques and to propose improvements.
A test specimen containing several different defects was designed specifically for this thesis. A flash lamp was used to heat the specimen and a high-speed infrared camera was used to study both the spatial and temporal features of the cooling process. An algorithm was implemented to detect anomalies and different parameter settings were evaluated. The results show that the proposed method is successful at finding the searched for defects, and also outperforms one of the old methods.
@mastersthesis{diva2:610166,
author = {Höglund, Kristofer},
title = {{Non-destructive Testing Using Thermographic Image Processing}},
school = {Linköping University},
type = {{LiTH-ISY-EX--13/4655--SE}},
year = {2013},
address = {Sweden},
}
This is a master thesis of the Master of Science degree program in Applied Physics and Electrical Engineering at Linköping University. The goal of this thesis is to find out how the Microsoft Kinect can be used as a part of a camera rig to create accurate 3D-models of an indoor environment. The Microsoft Kinect is marketed as a touch free game controller for the Microsoft Xbox 360 game console. The Kinect contains a color and a depth camera. The depth camera works by constantly projecting a near infrared dot pattern that is observed with a near infrared camera. In this thesis it is described how to model the near infrared projector pattern to enable external near infrared cameras to be used to improve the measurement precision. The depth data that the Kinect output have been studied to determine what types of errors it contains. The finding was that the Kinect uses an online calibration algorithm that changes the depth data.
@mastersthesis{diva2:566581,
author = {Nordmark, Anton},
title = {{Kinect 3D Mapping}},
school = {Linköping University},
type = {{LiTH-ISY-EX--12/4636--SE}},
year = {2012},
address = {Sweden},
}
The introduction of dual energy CT, DECT, in the field of medical healthcare has made it possible to extract more information of the scanned objects. This in turn has the potential to improve the accuracy in radiation therapy dose planning. One problem that remains before successful material decomposition can be achieved however, is the presence of beam hardening and scatter artifacts that arise in a scan. Methods currently in clinical use for removal of beam hardening often bias the CT numbers. Hence, the possibility for an appropriate tissue decomposition is limited.
Here a method for successful decomposition as well as removal of the beam hardening artifact is presented. The method uses effective linear attenuations for the five base materials, water, protein, adipose, cortical bone and marrow, to perform the decomposition on reconstructed simulated data. This is performed inside an iterative loop together with the polychromatic x-ray spectra to remove the beam hardening
@mastersthesis{diva2:549562,
author = {Grandell, Oscar},
title = {{An iterative reconstruction algorithm for quantitative tissue decomposition using DECT}},
school = {Linköping University},
type = {{LiTH-ISY-EX--12/4617--SE}},
year = {2012},
address = {Sweden},
}
In this master thesis a visual odometry system is implemented and explained. Visual odometry is a technique, which could be used on autonomous vehicles to determine its current position and is preferably used indoors when GPS is notworking. The only input to the system are the images from a stereo camera and the output is the current location given in relative position.
In the C++ implementation, image features are found and matched between the stereo images and the previous stereo pair, which gives a range of 150-250 verified feature matchings. The image coordinates are triangulated into a 3D-point cloud. The distance between two subsequent point clouds is minimized with respect to rigid transformations, which gives the motion described with six parameters, three for the translation and three for the rotation.
Noise in the image coordinates gives reconstruction errors which makes the motion estimation very sensitive. The results from six experiments show that the weakness of the system is the ability to distinguish rotations from translations. However, if the system has additional knowledge of how it is moving, the minimization can be done with only three parameters and the system can estimate its position with less than 5 % error.
@mastersthesis{diva2:550998,
author = {Johansson, Fredrik},
title = {{Visual Stereo Odometry for Indoor Positioning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--12/4621--SE}},
year = {2012},
address = {Sweden},
}
Functional Magnetic Resonance Imaging (fMRI) is one of the best techniques for neuroimaging and has revolutionized the way to understand the brain functions. It measures the changes in the blood oxygen level-dependent (BOLD) signal which is related to the neuronal activity. Complexity of the data, presence of different types of noises and the massive amount of data makes the fMRI data analysis a challenging one. It demands efficient signal processing and statistical analysis methods. The inference of the analysis is used by the physicians, neurologists and researchers for better understanding of the brain functions.
The purpose of this study is to design a toolbox for fMRI data analysis. It includes methods to detect the brain activity maps, estimation of the hemodynamic response (HDR) and the connectivity of the brain structures. This toolbox provides methods for detection of activated brain regions measured with Bayesian estimator. Results are compared with the conventional methods such as t-test, ordinary least squares (OLS) and weighted least squares (WLS). Brain activation and HDR are estimated with linear adaptive model and nonlinear method based on radial basis function (RBF) neural network. Nonlinear autoregressive with exogenous inputs (NARX) neural network is developed to model the dynamics of the fMRI data. This toolbox also provides methods to brain connectivity such as functional connectivity and effective connectivity. These methods are examined on simulated and real fMRI datasets.
@mastersthesis{diva2:551505,
author = {Budde, Kiran Kumar},
title = {{A Matlab Toolbox for fMRI Data Analysis: Detection, Estimation and Brain Connectivity}},
school = {Linköping University},
type = {{LiTH-ISY-EX--12/4600--SE}},
year = {2012},
address = {Sweden},
}
Autonomous vehicles have many application possibilities within many different fields like rescue missions, exploring foreign environments or unmanned vehicles etc. For such system to navigate in a safe manner, high requirements of reliability and security must be fulfilled.
This master's thesis explores the possibility to use the machine learning algorithm convolutional network on a robotic platform for autonomous path following. The only input to predict the steering signal is a monochromatic image taken by a camera mounted on the robotic car pointing in the steering direction. The convolutional network will learn from demonstrations in a supervised manner.
In this thesis three different preprocessing options are evaluated. The evaluation is based on the quadratic error and the number of correctly predicted classes. The results show that the convolutional network has no problem of learning a correct behaviour and scores good result when evaluated on similar data that it has been trained on. The results also show that the preprocessing options are not enough to ensure that the system is environment dependent.
@mastersthesis{diva2:534610,
author = {Schmiterlöw, Maria},
title = {{Autonomous Path Following Using Convolutional Networks}},
school = {Linköping University},
type = {{LiTH-ISY-EX--12/4577--SE}},
year = {2012},
address = {Sweden},
}
This is a master thesis of the Master of Science degree program in Applied Physics and Electrical Engineering (Y) at Linköping University. The goal of the projectis to develop an application for creating a map in real time from a video camera on a miniature unmanned aerial vehicle. This thesis project and report is a first exploratory study for this application. It implements a prototype method and evaluates it on sample sequences from an on-board video camera. The method first looks for good points to follow in the image and then tracks them in a sequence.The image is then pasted, or merged, together with previous images so that points from the different images align.
Two methods to find good points to follow are examined with focus on real-time performance. The result is that the much faster FAST detector method yielded satisfactory results good enough to replace the slower standard method of the Harris-Stephens corner detector.
It is also examined whether it is possible to assume that the ground is a flat surface in this application or if a computationally more expensive method estimating altitude information has to be used. The result is that at high altitudes or when the ground is close to flat in reality and the camera points straight downwards a two-dimensional method will do. If flying lower or with high objects in the picture, which is often the case in this application, it must to be taken into account that the points really are at different heights, hence the ground can not be assumed to be flat.
@mastersthesis{diva2:514063,
author = {Wolkesson, Henrik},
title = {{Realtime Mosaicing of Video Stream from $\mu$UAV}},
school = {Linköping University},
type = {{LiTH-ISY-EX--07/4140--SE}},
year = {2012},
address = {Sweden},
}
In today's industry 3D cameras are often used to inspect products. The camera produces both a 3D model and an intensity image by capturing a series of profiles of the object using laser triangulation. In many of these setups a physical encoder is attached to, for example, the conveyor belt that the product is travelling on. The encoder is used to get an accurate reading of the speed that the product has when it passes through the laser. Without this, the output image from the camera can be distorted due to a variation in velocity.
In this master thesis a method for integrating the functionality of this physical encoder into the software of the camera is proposed. The object is scanned together with a pattern, with the help of this pattern the object can be restored to its original proportions.
@mastersthesis{diva2:455669,
author = {Josefsson, Mattias},
title = {{3D camera with built-in velocity measurement}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4523--SE}},
year = {2011},
address = {Sweden},
}
Den här rapporten har utrett möjligheten att automatiskt identifiera diken frånflygburet insamlat LiDAR-data. Den metod för identifiering som har valts harförst skapat en höjdbild från LiDAR-data. Därefter har den tagit fram kandidatertill diken genom att vektorisera resultatet från en linjedetektering. Egenskaper-na för dikeskandidaterna har sedan beräknats genom en analys av höjdprofilerför varje enskild kandidat, där höjdprofilerna skapats utifrån ursprungliga data.Genom att filtrera kandidaterna efter deras egenskaper kan dikeskartor med an-vändarspecificerade mått på diken presenteras i ett vektorformat som underlättarvidare användning. Rapporten beskriver hur algoritmen har implementerats ochpresenterar också exempel på resultat. Efter en analys av algoritmen samt förslagpå förbättringar presenteras den viktigaste behållningen av rapporten; Att det ärmöjligt med automatisk detektering av diken.
@mastersthesis{diva2:456702,
author = {Wasell, Richard},
title = {{Automatisk detektering av diken i LiDAR-data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4524--SE}},
year = {2011},
address = {Sweden},
}
When patients move during an MRI examination, severe artifacts arise in the reconstructed image and motion correction is therefore often desired. An in-plane motion correction algorithm suitable for PRESTO-CAN, a new 3D functional MRI method where sampling of k-space is radial in kx-direction and kz-direction and Cartesian in ky-direction, was implemented in this thesis work.
Rotation and translation movements can be estimated and corrected for sepa- rately since the magnitude of the data is only affected by the rotation. The data were sampled in a radial pattern and the rotation was estimated by finding the translation in angular direction using circular correlation. Correlation was also used when finding the translation in x-direction and z-direction.
The motion correction algorithm was evaluated on computer simulated data, the motion was detected and corrected for, and this resulted in images with greatly reduced artifacts due to patient movements.
@mastersthesis{diva2:456354,
author = {Karlsson, Anette},
title = {{In-Plane Motion Correction in Reconstruction of non-Cartesian 3D-functional MRI}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4480--SE}},
year = {2011},
address = {Sweden},
}
MRI (Magnetic Resonance Imaging) is a medical imaging method that uses magnetic fields in order to retrieve images of the human body. This thesis revolves around a novel acquisition method of 3D fMRI (functional Magnetic Resonance Imaging) called PRESTO-CAN that uses a radial pattern in order to sample the (kx,kz)-plane of k-space (the frequency domain), and a Cartesian sample pattern in the ky-direction. The radial sample pattern allows for a denser sampling of the central parts of k-space, which contain the most basic frequency information about the structure of the recorded object. This allows for higher temporal resolution to be achieved compared with other sampling methods since a fewer amount of total samples are needed in order to retrieve enough information about how the object has changed over time. Since fMRI is mainly used for monitoring blood flow in the brain, increased temporal resolution means that we can be able to track fast changes in brain activity more efficiently.The temporal resolution can be further improved by reducing the time needed for scanning, which in turn can be achieved by applying parallel imaging. One such parallel imaging method is SENSE (SENSitivity Encoding). The scan time is reduced by decreasing the sampling density, which causes aliasing in the recorded images. The aliasing is removed by the SENSE method by utilizing the extra information provided by the fact that multiple receiver coils with differing sensitivities are used during the acquisition. By measuring the sensitivities of the respective receiver coils and solving an equation system with the aliased images, it is possible to calculate how they would have looked like without aliasing.In this master thesis, SENSE has been successfully implemented in PRESTO-CAN. By using normalized convolution in order to refine the sensitivity maps of the receiver coils, images with satisfying quality was able to be reconstructed when reducing the k-space sample rate by a factor of 2, and images of relatively good quality also when the sample rate was reduced by a factor of 4. In this way, this thesis has been able to contribute to the improvement of the temporal resolution of the PRESTO-CAN method.
@mastersthesis{diva2:423964,
author = {Ahlman, Gustav},
title = {{Improved Temporal Resolution Using Parallel Imaging in Radial-Cartesian 3D functional MRI}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4470--SE}},
year = {2011},
address = {Sweden},
}
Today, 3D models of cities are created from aerial images using a camera rig. Images, together with sensor data from the flights, are stored for further processing when building 3D models. However, there is a market demand for a more mobile solution of satisfactory quality. If the camera position can be calculated for each image, there is an existing algorithm available for the creation of 3D models.
This master thesis project aims to investigate whether the iPhone 4 offers good enough image and sensor data quality from which 3D models can be created. Calculations on movements and rotations from sensor data forms the foundation of the image processing, and should refine the camera position estimations.
The 3D models are built only from image processing since sensor data cannot be used due to poor data accuracy. Because of that, the scaling of the 3D models are unknown and a measurement is needed on the real objects to make scaling possible. Compared to a test algorithm that calculates 3D models from only images, already available at the SBD’s system, the quality of the 3D model in this master thesis project is almost the same or, in some respects, even better when compared with the human eye.
@mastersthesis{diva2:452945,
author = {Lundqvist, Tobias},
title = {{3D mapping with iPhone}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4517--SE}},
year = {2011},
address = {Sweden},
}
In this thesis, an investigation was performed to find ways of differencing between firesand vehicles at waste stations in hope of removing vehicles as a source of error duringearly fire detection. The existing system makes use of a heat camera, which rotates in 48different angles (also known as zones) in a fixed position. If the heat is above a certainvalue within a zone the system sounds the fire alarm.The rotation of the camera results in an unwanted displacement between two successiveframes within the same zone. By use of image registration, this displacement wasremoved. After the registration of an image, segmentation was performed where coldobjects are eliminated as an error source. Lastly, an analysis was performed upon thewarm objects.At the end, it was proven that the image registration had been a successful improvementof the existing system. It was also shown that vehicles can, to some extent, beeliminated as an error source.
@mastersthesis{diva2:446792,
author = {Söderström, Rikard},
title = {{An early fire detection system through registration and analysis of waste station IR-images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4354--SE}},
year = {2011},
address = {Sweden},
}
Medical imaging is an important tool for diagnosis and treatment planning today. However as the demand for efficiency increases at the same time as the data volumes grow immensely, the need for computer assisted analysis, such as image segmentation, to help and guide the practitioner increases.
Medical image segmentation could be used for various different tasks, the localization and delineation of pathologies such as cancer tumors is just one example. Numerous problems with noise and image artifacts in the generated images make the segmentation a difficult task, and the developer is forced to choose between speed and performance. In clinical practise, however, this is impossible as both speed and performance are crucial. One solution to this problem might be to involve the user more in the segmentation, using interactivite algorithms where the user might influence the segmentation for an improved result.
This thesis has concentrated on finding a fast and interactive segmentation method for liver tumor segmentation. Various different methods were explored, and a few were chosen for implementation and further development. Two methods appeared to be the most promising, Bayesian Region Growing (BRG) and Level Set.
An interactive Level Set algorithm emerged as the best alternative for the interactivity of the algorithm, and could be used in combination with both BRG and Level Set. A new data term based on a probability model instead of image edges was also explored for the Level Set-method, and proved to be more promising than the original one. The probability based Level Set and the BRG method both provided good quality results, but the fastest of the two was the BRG-method, which could segment a tumor present in 25 CT image slices in less than 10 seconds when implemented in Matlab and mex-C++ code on an ACPI x64-based PC with two 2.4 GHz Intel(R) Core(TM) 2CPU and 8 GB RAM memory. The interactive Level Set could be succesfully used as an interactive addition to the automatic method, but its usefulness was somewhat reduced by its slow processing time ( 1.5 s/slice) and the relative complexity of the needed user interactions.
@mastersthesis{diva2:438557,
author = {Thomasson, Viola},
title = {{Liver Tumor Segmentation Using Level Sets and Region Growing}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4485--SE}},
year = {2011},
address = {Sweden},
}
Most mobile video-recording devices of today, e.g. cell phones and music players, make use of a rolling shutter camera. A rolling shutter camera captures video by recording every frame line-by-line from top to bottom of the image, leading to image distortions in situations where either the device or the target is moving. Recording video by hand also leads to visible frame-to-frame jitter.
In this thesis, methods to decrease distortion caused by the motion of a video-recording device with a rolling shutter camera are presented. The methods are based on estimating the orientation of the camera from gyroscope and accelerometer measurements.
The algorithms are implemented on the iPod Touch 4, and the resulting videos are compared to those of competing stabilization software, both commercial and free, in a series of blind experiments. The results from this user study shows that the methods presented in the thesis perform equal to or better than the others.
@mastersthesis{diva2:420914,
author = {Hanning, Gustav},
title = {{Video Stabilization and Rolling Shutter Correction using Inertial Measurement Sensors}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4464--SE}},
year = {2011},
address = {Sweden},
}
Monitoring wear particles in lubricating oils allows specialists to evaluate thehealth and functionality of a mechanical system. The main analysis techniquesavailable today are manual particle analysis and automatic optical analysis. Man-ual particle analysis is effective and reliable since the analyst continuously seeswhat is being counted . The drawback is that the technique is quite time demand-ing and dependent of the skills of the analyst. Automatic optical particle countingconstitutes of a closed system not allowing for the objects counted to be observedin real-time. This has resulted in a number of sources of error for the instrument.In this thesis a new method for counting particles based on light microscopywith image analysis is proposed. It has proven to be a fast and effective methodthat eliminates the sources of error of the previously described methods. Thenew method correlates very well with manual analysis which is used as a refer-ence method throughout this study. Size estimation of particles and detectionof metallic particles has also shown to be possible with the current image analy-sis setup. With more advanced software and analysis instrumentation, the imageanalysis method could be further developed to a decision based machine allowingfor declarations about which wear mode is occurring in a mechanical system.
@mastersthesis{diva2:420518,
author = {Ceco, Ema},
title = {{Image Analysis in the Field of Oil Contamination Monitoring}},
school = {Linköping University},
type = {{LITH-ISY-EX--11/4467--SE}},
year = {2011},
address = {Sweden},
}
In this master thesis, a model-based video coding algorithm has been developed that uses input from a colour and depth camera, such as the Microsoft Kinect. Using a model-based representation of a video has several advantages over the commonly used block-based approach, used by the H.264 standard. For example, videos can be rendered in 3D, be viewed from alternative views, and have objects inserted into them for augmented reality and user interaction.
This master thesis demonstrates a very efficient way of encoding the geometry of a scene. The results of the proposed algorithm show that it can reach very low bitrates with comparable results to the H.264 standard.
@mastersthesis{diva2:420400,
author = {Sandberg, David},
title = {{Model-Based Video Coding Using a Colour and Depth Camera}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4463--SE}},
year = {2011},
address = {Sweden},
}
In this master thesis the possibility of detecting and tracking objects in multispectral infrared video sequences is investigated. The current method with fix-sized rectangles have significant disadvantages. These disadvantages will be solved using image segmentation to estimate the shape of the object. The result of the image segmentation is used to determine the infrared contrast of the object. Our results show how some objects will give very good segmentation, tracking as well as shape detection. The objects that perform best are the flares and countermeasures. But especially helicopters seen from the side, with significant movements, is better detected with our method. The motion of the object is very important since movement is the main component in successful shape detection. This is so because helicopters are much colder than flares and engines. Detecting the presence and position of moving objects is easier and can be done quite successfully even with helicopters. But using structure tensors we can also detect the presence and estimate the position for stationary objects.
@mastersthesis{diva2:415941,
author = {Möller, Sebastian},
title = {{Image Segmentation and Target Tracking using Computer Vision}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4424--SE}},
year = {2011},
address = {Sweden},
}
3D cameras delivering height data can be used for quality inspection of goods on a conveyor.
It is then of interest to distinguish the important parts of the image from background and noise and further to divide these interesting parts into segments that have a strong correlation to objects on the conveyor belt.
Segmentation can easily be done by thresholding in the simple case. However, in more complex situations, for example when objects touch or overlap, this does not work well.
In this thesis, research and evaluation of a few different methods for segmentation of height image data are presented. The focus is to find an accurate method for segmentation of smooth irregularly shaped organic objects such as vegetables or shellfish.
For evaluative purposes a database consisting of height images depicting a variety of such organic objects has been collected.
We show in the thesis that a conventional gradient magnitude method is hard to beat in the general case. If, however, the objects to be segmented are heavily non-convex with a lot of crests and valleys within themselves one could be better off choosing a normalized least squares method.
@mastersthesis{diva2:393236,
author = {Schöndell, Andreas},
title = {{Evaluation of methods for segmentation of 3D range image data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--11/4346--SE}},
year = {2011},
address = {Sweden},
}
This thesis describes the development of a robotic platform for evaluation of gaze stabilization algorithms built for the Sensorimotor Systems Laboratory at the University of British Columbia. The primary focus of the work was to measure the performance of a biomimetic vestibulo-ocular reflex controller for gaze stabilization using cerebellar feedback. A flexible robotic system was designed and built in order to run reproducible test sequences at high speeds featuring three dimensional linear movement and rotation around the vertical axis. On top of the robot head a 1 DOF camera head can be independently controlled by a stabilization algorithm implemented in Simulink. Vestibular input is provided by a 3-axis accelerometer and a 3-axis gyroscope. The video feed from the camera head is fed into a workstation computer running a custom image processing program which evaluates both the absolute and relative movement of the images in the sequence. The absolute angles of tracked regions in the image are continuously returned, as well as the movement of the image sequence across the sensor in full 3 DOF camera rotation. Due to dynamic downsampling and noise suppression algorithms very good performance was reached, enabling retinal slip estimation at 720 degrees per second. Two different controllers were implemented, one adaptive open loop controller similar to Dean et al.’s work[12] and one reference implementation using closed loop control and optimal linear estimation of reference angles. A sequence of tests were run in order to evaluate the performance of the two algorithms. The adaptive controller was shown to offer superior performance, dramatically reducing the movement of the image for all test sequences, while also offering better performance as it was tuned over time.
@mastersthesis{diva2:359452,
author = {Landgren, Axel},
title = {{A robotic camera platform for evaluation of biomimetic gaze stabilization using adaptive cerebellar feedback}},
school = {Linköping University},
type = {{LiTH-ISY-EX--10/4351--SE}},
year = {2010},
address = {Sweden},
}
This master thesis investigates the difficulties of constructing a depth map using one low resolution grayscale camera mounted in the front of a car. The goal is to produce a depth map in real-time to assist other algorithms in the safety system of a car. This has been shown to be difficult using the evaluated combination of camera position and choice of algorithms.
The main problem is to estimate an accurate optical flow. Another problem is to handle moving objects. The conclusion is that the implementations, mainly triangulation of corresponding points tracked using a Lucas Kanade tracker, provide information of too poor quality to be useful for the safety system of a car.
@mastersthesis{diva2:355971,
author = {Svensson, Fredrik},
title = {{Structure from Forward Motion}},
school = {Linköping University},
type = {{LiTH-ISY-EX--10/4364--SE}},
year = {2010},
address = {Sweden},
}
This thesis treats topics within the area of object recognition. A real-time view matching method has been developed to compute the transformation between two different images of the same scene. This method uses a color based region detector called MSCR and affine transformations of these regions to create affine-invariant patches that are used as input to the SIFT algorithm. A parallel method to compute the SIFT descriptor has been created with relaxed constraints so that the descriptor size and the number of histogram bins can be adjusted. Additionally, a matching step to deduce correspondences and a parallel RANSAC method have been created to estimate the undergone transformation between these descriptors. To achieve real-time performance, the implementation has been targeted to use the parallel nature of the GPU with CUDA as the programming language. Focus has been put on the architecture of the GPU to find the best way to parallelize the different processing steps. CUDA has also been combined with OpenGL to be able to use the hardware accelerated anisotropic sampling method for affine transformations of regions. Parts of the implementation can also be used individually from either Matlab or by using the provided C++ library directly. The method was also evaluated in terms of accuracy and speed. It was shown that our algorithm has similar or better accuracy at finding correspondences than SIFT when the 3D geometry changes are large but we get a slightly worse result on images with flat surfaces.
@mastersthesis{diva2:345932,
author = {Lind, Anders},
title = {{High-speed View Matching using Region Descriptors}},
school = {Linköping University},
type = {{LiTH-ISY-EX--10/4356--SE}},
year = {2010},
address = {Sweden},
}
The thesis presents an investigation of the potential of measuring plant condition from hyperspectral reflectance data. To do this, some linear methods for embedding the high dimensional hyperspectral data and to perform regression to a plant condition space have been compared. A preprocessing step that aims at normalized illumination intensity in the hyperspectral images has been conducted and some different methods for this purpose have also been compared.A large scale experiment has been conducted where tobacco plants have been grown and treated differently with respect to watering and nutrition. The treatment of the plants has served as ground truth for the plant condition. Four sets of plants have been grown one week apart and the plants have been measured at different ages up to the age of about five weeks. The thesis concludes that there is a relationship between plant treatment and their leaves' spectral reflectance, but the treatment has to be somewhat extreme for enabling a useful treatment approximation from the spectrum. CCA has been the proposed method for calculation of the hyperspectral basis that is used to embed the hyperspectral data to the plant condition (treatment) space. A preprocessing method that uses a weighted normalization of the spectrums for illumination intensity normalization is concluded to be the most powerful of the compared methods.
@mastersthesis{diva2:350907,
author = {Johansson, Peter},
title = {{Plant Condition Measurement from Spectral Reflectance Data}},
school = {Linköping University},
type = {{LiTH-ISY-EX--10/4369--SE}},
year = {2010},
address = {Sweden},
}
Man portable air defence systems, MANPADS, pose a big threat to civilian and military aircraft. This thesis aims to find methods that could be used in a missile approach warning system based on infrared cameras.
The two main tasks of the completed system are to classify the type of missile, and also to estimate its position and velocity from a sequence of images.
The classification is based on hidden Markov models, one-class classifiers, and multi-class classifiers.
Position and velocity estimation uses a model of the observed intensity as a function of real intensity, image coordinates, distance and missile orientation. The estimation is made by an extended Kalman filter.
We show that fast classification of missiles based on radiometric data and a hidden Markov model is possible and works well, although more data would be needed to verify the results.
Estimating the position and velocity works fairly well if the initial parameters are known. Unfortunately, some of these parameters can not be computed using the available sensor data.
@mastersthesis{diva2:323455,
author = {Holm Ovr\'{e}n, Hannes and Emilsson, Erika},
title = {{Missile approach warning using multi-spectral imagery}},
school = {Linköping University},
type = {{LiTH-ISY-EX--10/4329--SE}},
year = {2010},
address = {Sweden},
}
Most people are familiar with the BRIO labyrinth game and the challenge of guiding the ball through the maze. The goal of this project was to use this game to create a platform for evaluation of control algorithms. The platform was used to evaluate a few different controlling algorithms, both traditional automatic control algorithms as well as algorithms based on online incremental learning.
The game was fitted with servo actuators for tilting the maze. A camera together with computer vision algorithms were used to estimate the state of the game. The evaluated controlling algorithm had the task of calculating a proper control signal, given the estimated state of the game.
The evaluated learning systems used traditional control algorithms to provide initial training data. After initial training, the systems learned from their own actions and after a while they outperformed the controller used to provide initial training.
@mastersthesis{diva2:322572,
author = {Öfjäll, Kristoffer},
title = {{LEAP, A Platform for Evaluation of Control Algorithms}},
school = {Linköping University},
type = {{LiTH-ISY-EX--10/4370--SE}},
year = {2010},
address = {Sweden},
}
In this thesis an algorithm for producing saliency maps as well as an algorithm for detecting salient regions based on the saliency map was developed. The saliency values are computed as center-surround differences and a local descriptor called the region p-channel is used to represent center and surround respectively. An integral image representation called the integral p-channel is used to speed up extraction of the local descriptor for any given image region. The center-surround difference is calculated as either histogram or p-channel dissimilarities.
Ground truth was collected using human subjects and the algorithm’s ability to detect salient regions was evaluated against this ground truth. The algorithm was also compared to another saliency algorithm.
Two different center-surround interpretations are tested, as well as several p-channel and histogram dissimilarity measures. The results show that for all tested settings the best performing dissimilarity measure is the so called diffusion distance. The performance comparison showed that the algorithm developed in this thesis outperforms the algorithm against which it was compared, both with respect to region detection and saliency ranking of regions. It can be concluded that the algorithm shows promising results and further investigation of the algorithm is recommended. A list of suggested approaches for further research is provided.
@mastersthesis{diva2:291472,
author = {Tuttle, Alexander},
title = {{Saliency Maps using Channel Representations}},
school = {Linköping University},
type = {{LITH-ISY-EX--10/4169--SE}},
year = {2010},
address = {Sweden},
}
Foreground segmentation is a common first step in tracking and surveillance applications. The purpose of foreground segmentation is to provide later stages of image processing with an indication of where interesting data can be found. This thesis is an investigation of how foreground segmentation can be performed in two contexts: as a pre-step to trajectory tracking and as a pre-step in indoor surveillance applications.
Three methods are selected and detailed: a single Gaussian method, a Gaussian mixture model method, and a codebook method. Experiments are then performed on typical input video using the methods. It is concluded that the Gaussian mixture model produces the output which yields the best trajectories when used as input to the trajectory tracker. An extension is proposed to the Gaussian mixture model which reduces shadow, improving the performance of foreground segmentation in the surveillance context.
@mastersthesis{diva2:285807,
author = {Molin, Joel},
title = {{Foreground Segmentation of Moving Objects}},
school = {Linköping University},
type = {{LiTH-ISY-EX--10/4299--SE}},
year = {2010},
address = {Sweden},
}
Within this thesis an algorithm for object recognition called Cluster Matching has been developed, implemented and evaluated. The image information is sampled at arbitrary sample points, instead of interest points, and local image features are extracted. These sample points are used as a compact representation of the image data and can quickly be searched for prior known objects. The algorithm is evaluated on a test set of images and the result is surprisingly reliable and time efficient.
@mastersthesis{diva2:284633,
author = {Lennartsson, Mattias},
title = {{Object Recognition with Cluster Matching}},
school = {Linköping University},
type = {{LITH-ISY-EX--09/4152--SE}},
year = {2009},
address = {Sweden},
}
Time of flight is an imaging technique with uses depth information to capture 3D information in a scene. Recent developments in the technology have made ToF cameras more widely available and practical to work with. The cameras now enable real time 3D imaging and positioning in a compact unit, making the technology suitable for variety of object recognition tasks
An object recognition system for locating teats is at the center of the DeLaval VMS, which is a fully automated system for milking cows. By implementing ToF technology as part of the visual detection procedure, it would be possible to locate and track all four teat’s positions in real time and potentially provide an improvement compared with the current system.
The developed algorithm for teat detection is able to locate teat shaped objects in scenes and extract information of their position, width and orientation. These parameters are determined with an accuracy of millimeters. The algorithm also shows promising results when tested on real cows. Although detecting many false positives the algorithm was able to correctly detected 171 out of 232 visible teats in a test set of real cow images. This result is a satisfying proof of concept and shows the potential of ToF technology in the field of automated milking.
@mastersthesis{diva2:224321,
author = {Westberg, Michael},
title = {{Time of Flight Based Teat Detection}},
school = {Linköping University},
type = {{LiTH-ISY-EX--09/4154 --SE}},
year = {2009},
address = {Sweden},
}
This thesis is about improving the image quality of image sequences scanned by the film scanner GoldenEye. Film grain is often seen as an artistic effect in film sequences but scanned images can be more grainy or noisy than the intention. To remove the grain and noise as well as sharpen the images a few known image enhancement methods have been implemented, tested and evaluated. An own idea of a thresholding method using the dyadic wavelet transform has also been tested. As benchmark has MATLAB been used but one method has also been implemented in C/C++. Some of the methods works satisfactory when it comes to the image result but none of the methods works satisfactory when it comes to time consumption. To solve that a few speed up ideas are suggested in the end of the thesis. A method to correct the color of the sequences has also been suggested.
@mastersthesis{diva2:210478,
author = {Stuhr, Lina},
title = {{Grain Reduction in Scanned Image Sequences under Time Constraints}},
school = {Linköping University},
type = {{LiTH-ISY-EX--09/4203--SE}},
year = {2009},
address = {Sweden},
}
Gaze tracking is the estimation of the point in space a person is “looking at”. This is widely used in both diagnostic and interactive applications, such as visual attention studies and human-computer interaction. The most common commercial solution used to track gaze today uses a combination of infrared illumination and one or more cameras. These commercial solutions are reliable and accurate, but often expensive. The aim of this thesis is to construct a simple single-camera gaze tracker from off-the-shelf components. The method used for gaze tracking is based on infrared illumination and a schematic model of the human eye. Based on images of reflections of specific light sources in the surfaces of the eye the user’s gaze point will be estimated. Evaluation is also performed on both the software and hardware components separately, and on the system as a whole. Accuracy is measured in spatial and angular deviation and the result is an average accuracy of approximately one degree on synthetic data and 0.24 to 1.5 degrees on real images at a range of 600 mm.
@mastersthesis{diva2:209626,
author = {Wallenberg, Marcus},
title = {{A Single-Camera Gaze Tracker using Controlled Infrared Illumination}},
school = {Linköping University},
type = {{LITH-ISY-EX--09/4199--SE}},
year = {2009},
address = {Sweden},
}
The PRESTO sequence is a well-known 3-D fMRI imaging sequence. In this sequence the echo planar imaging technique is merged with the echo-shift technique. This combination results in a very fast image acquisition, which is required for fMRI examinations of neural activation in the human brain. The aim of this work was to use the basic Cartesian PRESTO sequence as a framework when developing a novel trajectory using a non-Cartesian grid.
Our new pulse sequence, PRESTO CAN, rotates the k-space profiles around the ky-axis in a non-Cartesian manner. This results in a high sampling density close to the centre of the k-space, and at the same time it provides sparser data collection of the part of the k-space that contains less useful information. This "can- or cylinder-like" pattern is expected to result in a much faster k-space acquisition without loosing important spatial information.
A new reconstruction algorithm was also developed. The purpose was to be able to construct an image volume from data obtained using the novel PRESTO CAN sequence. This reconstruction algorithm was based on the gridding technique, and a Kaiser-Bessel window was also used in order to re-sample the data onto a Cartesian grid. This was required to make 3-D Fourier transformation possible. In addition, simulations were also performed in order to verify the function of the reconstruction algorithm. Furthermore, in vitro tests showed that the development of the PRESTO CAN sequence and the corresponding reconstruction algorithm were highly successful.
In the future, the results can relatively easily be extended and generalized for in vivo investigations. In addition, there are numerous exciting possibilities for extending the basic techniques described in this thesis.
@mastersthesis{diva2:397232,
author = {Thyr, Per},
title = {{Method for Acquisition and Reconstruction of non-Cartesian 3-D fMRI}},
school = {Linköping University},
type = {{LITH-ISY-EX--08/4058--SE}},
year = {2008},
address = {Sweden},
}
an Internal Navigational System and a Global Navigational Satellite System (GNSS). In navigational warfare the GNSS can be jammed, therefore are a third navigational system is needed. The system that has been tried in this thesis is camera based navigation. Through a video camera and a sensor reference the position is determined. This thesis will process the matching between the sensor reference and the video image.
Two methods have been implemented: normalized cross correlation and position determination through a homography. Normalized cross correlation creates a correlation matrix. The other method uses point correspondences between the images to determine a homography between the images. And through the homography obtain a position. The more point correspondences the better the position determination will be.
The results have been quite good. The methods have got the right position when the Euler angles of the UAV have been known. Normalized cross correlation has been the best method of the tested methods.
@mastersthesis{diva2:128466,
author = {Olgemar, Markus},
title = {{Camera Based Navigation:
Matching between Sensor reference and Video image}},
school = {Linköping University},
type = {{LITH-ISY-EX--08/4170--SE}},
year = {2008},
address = {Sweden},
}
There has been a rapid progress of the graphics processor the last years, much because of the demands from computer games on speed and image quality. Because of the graphics processor’s special architecture it is much faster at solving parallel problems than the normal processor. Due to its increasing programmability it is possible to use it for other tasks than it was originally designed for.
Even though graphics processors have been programmable for some time, it has been quite difficult to learn how to use them. CUDA enables the programmer to use C-code, with a few extensions, to program NVIDIA’s graphics processor and completely skip the traditional programming models. This thesis investigates if the graphics processor can be used for calculations without knowledge of how the hardware mechanisms work. An image processing algorithm calculating the optical flow has been implemented. The result shows that it is rather easy to implement programs using CUDA, but some knowledge of how the graphics processor works is required to achieve high performance.
@mastersthesis{diva2:127132,
author = {Ringaby, Erik},
title = {{Optical Flow Computation on Compute Unified Device Architecture}},
school = {Linköping University},
type = {{LiTH-ISY-EX--08/4043--SE}},
year = {2008},
address = {Sweden},
}
This Master Thesis has been conducted at the National Laboratory of Forensic Science (SKL) in Linköping. When images that are to be analyzed at SKL, presenting an interesting object, are of bad quality there may be a need to enhance them. If several images with the object are available, the total amount of information can be used in order to estimate one single enhanced image. A program to do this has been developed by studying methods for image registration and high resolution image estimation. Tests of important parts of the procedure have been conducted. The final results are satisfying and the key to a good high resolution image seems to be the precision of the image registration. Improvements of this part may lead to even better results. More suggestions for further improvementshave been proposed.
@mastersthesis{diva2:390,
author = {Karelid, Mikael},
title = {{Image Enhancement over a Sequence of Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--08/4013--SE}},
year = {2008},
address = {Sweden},
}
The purpose of this master thesis was to study computer vision algorithms for vehicle detection in monochrome images captured by mono camera. The work has mainly been focused on detecting rear-view cars in daylight conditions. Previous work in the literature have been revised and algorithms based on edges, shadows and motion as vehicle cues have been modified, implemented and evaluated. This work presents a combination of a multiscale edge based detection and a shadow based detection as the most promising algorithm, with a positive detection rate of 96.4% on vehicles at a distance of between 5 m to 30 m. For the algorithm to work in a complete system for vehicle detection, future work should be focused on developing a vehicle classifier to reject false detections.
@mastersthesis{diva2:18234,
author = {Lundagårds, Marcus},
title = {{Vehicle Detection in Monochrome Images}},
school = {Linköping University},
type = {{LiTH-ISY-EX--08/4148--SE}},
year = {2008},
address = {Sweden},
}
In this thesis it is examined whether the pose of an object can be determined by a system trained with a synthetic 3D model of said object. A number of variations of methods using P-channel representation are examined. Reference images are rendered from the 3D model, features, such as gradient orientation and color information are extracted and encoded into P-channels. The P-channel representation is then used to estimate an overlapping channel representation, using B1-spline functions, to estimate a density function for the feature set. Experiments were conducted with this representation as well as the raw P-channel representation in conjunction with a number of distance measures and estimation methods.
It is shown that, with correct preprocessing and choice of parameters, the pose can be detected with some accuracy and, if not in real-time, fast enough to be useful in a tracker initialization scenario. It is also concluded that the success rate of the estimation depends heavily on the nature of the object.
@mastersthesis{diva2:17521,
author = {Berg, Martin},
title = {{Pose Recognition for Tracker Initialization Using 3D Models}},
school = {Linköping University},
type = {{LiTH-ISY-EX--07/4076--SE}},
year = {2008},
address = {Sweden},
}
In this thesis spacetime analysis is applied to laser triangulation in an attempt to eliminate certain artifacts caused mainly by reflectance variations of the surface being measured. It is shown that spacetime analysis do eliminate these artifacts almost completely, it is also shown that the shape of the laser beam used no longer is critical thanks to the spacetime analysis, and that in some cases the laser probably even could be exchanged for a non-coherent light source. Furthermore experiments of running the derived algorithm on a GPU (Graphics Processing Unit) are conducted with very promising results.
The thesis starts by deriving the theory needed for doing spacetime analysis in a laser triangulation setup taking perspective distortions into account, then several experiments evaluating the method is conducted.
@mastersthesis{diva2:17262,
author = {Benderius, Björn},
title = {{Laser Triangulation Using Spacetime Analysis}},
school = {Linköping University},
type = {{LITH-ISY-EX--07/4047--SE}},
year = {2007},
address = {Sweden},
}
In this thesis, two real-time stereo methods have been implemented and evaluated. The first one is based on blockmatching and the second one is based on local phase. The goal was to be able to run the algorithms at real-time and examine which one is best. The blockmatching method performed better than the phase based method, both in speed and accuracy. SIMD operations (Single Instruction Multiple Data) have been used in the processor giving a speed boost by a factor of two.
@mastersthesis{diva2:16992,
author = {Arvidsson, Lars},
title = {{Stereoseende i realtid}},
school = {Linköping University},
type = {{LITH-ISY-EX--07/3944--SE}},
year = {2007},
address = {Sweden},
}
Today, tool center point calibration is mostly done by a manual procedure. The method is very time consuming and the result may vary due to how skilled the operators are.
This thesis proposes a new automated iterative method for tool center point calibration of industrial robots, by making use of computer vision and image processing techniques. The new method has several advantages over the manual calibration method. Experimental verifications have shown that the proposed method is much faster, still delivering a comparable or even better accuracy. The setup of the proposed method is very easy, only one USB camera connected to a laptop computer is needed and no contact with the robot tool is necessary during the calibration procedure.
The method can be split into three different parts. Initially, the transformation between the robot wrist and the tool is determined by solving a closed loop of homogeneous transformations. Second an image segmentation procedure is described for finding point correspondences on a rotation symmetric robot tool. The image segmentation part is necessary for performing a measurement with six degrees of freedom of the camera to tool transformation. The last part of the proposed method is an iterative procedure which automates an ordinary four point tool center point calibration algorithm. The iterative procedure ensures that the accuracy of the tool center point calibration only depends on the accuracy of the camera when registering a movement between two positions.
@mastersthesis{diva2:23964,
author = {Hallenberg, Johan},
title = {{Robot Tool Center Point Calibration using Computer Vision}},
school = {Linköping University},
type = {{LiTH-ISY-EX-- 07/3943--SE}},
year = {2007},
address = {Sweden},
}
A common problem when using background models to segment moving objects from video sequences is that objects cast shadow usually significantly differ from the background and therefore get detected as foreground. This causes several problems when extracting and labeling objects, such as object shape distortion and several objects merging together. The purpose of this thesis is to explore various possibilities to handle this problem.
Three methods for statistical background modeling are reviewed. All methods work on a per pixel basis, the first is based on approximating the median, the next on using Gaussian mixture models, and the last one is based on channel representation. It is concluded that all methods detect cast shadows as foreground.
A study of existing methods to handle cast shadows has been carried out in order to gain knowledge on the subject and get ideas. A common approach is to transform the RGB-color representation into a representation that separates color into intensity and chromatic components in order to determine whether or not newly sampled pixel-values are related to the background. The color spaces HSV, IHSL, CIELAB, YCbCr, and a color model proposed in the literature (Horprasert et al.) are discussed and compared for the purpose of shadow detection. It is concluded that Horprasert's color model is the most suitable for this purpose.
The thesis ends with a proposal of a method to combine background modeling using Gaussian mixture models with shadow detection using Horprasert's color model. It is concluded that, while not perfect, such a combination can be very helpful in segmenting objects and detecting their cast shadow.
@mastersthesis{diva2:23393,
author = {Wood, John},
title = {{Statistical Background Models with Shadow Detection for Video Based Tracking}},
school = {Linköping University},
type = {{LITH-ISY-EX--07/3921--SE}},
year = {2007},
address = {Sweden},
}
The objective of this thesis is to investigate if it is possible to use stereo vision to find and track the players and the ball during a football game.
The thesis shows that it is possible to detect all players that isn’t too occluded by another player. Situations when a player is occluded by another player is solved by tracking the players from frame to frame.
The ball is also detected in most frames by looking for ball-like features. As with the players the ball is tracked from frame to frame so that when the ball is occluded, the positions is estimated by the tracker.
@mastersthesis{diva2:23152,
author = {Borg, Johan},
title = {{Detecting and Tracking Players in Football Using Stereo Vision}},
school = {Linköping University},
type = {{LiTH-ISY-EX--07/3535--SE}},
year = {2007},
address = {Sweden},
}
The increased usage of infrared sensors by pilots has created a growing demand for simulated environments based on infrared radiation. This has led to an increased need for Saab to refine their existing model for simulating real-time infrared imagery, resulting in the carrying through of this thesis. Saab develops the Gripen aircraft, and they provide training simulators where pilots can train in a realistic environment. The new model is required to be based on the real-world behavior of infrared radiation, and furthermore, unlike Saab's existing model, have dynamically changeable attributes.
This thesis seeks to develop a simulation model compliant with the requirements presented by Saab, and to develop the implementation of a test environment demonstrating the features and capabilities of the proposed model. All through the development of the model, the pilot training value has been kept in mind.
The first part of the thesis consists of a literature study to build a theoretical base for the rest of the work. This is followed by the development of the simulation model itself and a subsequent implementation thereof. The simulation model and the test implementation are evaluated as the final step conducted within the framework of this thesis.
The main conclusions of this thesis first of all includes that the proposed simulation model does in fact have its foundation in physics. It is further concluded that certain attributes of the model, such as time of day, are dynamically changeable as requested. Furthermore, the test implementation is considered to have been feasibly integrated with the current simulation environment.
A plan concluding how to proceed has also been developed. The plan suggests future work with the proposed simulation model, since the evaluation shows that it performs well in comparison to the existing model as well as other products on the market.
@mastersthesis{diva2:22896,
author = {Dehlin, Jonas and Löf, Joakim},
title = {{Dynamic Infrared Simulation:
A Feasibility Study of a Physically Based Infrared Simulation Model}},
school = {Linköping University},
type = {{LITH-ISY-EX--06/3815--SE}},
year = {2006},
address = {Sweden},
}
Fluoroskopi är benämningen på kontinuerlig röntgengenomlysning av en patient. Eftersom patienten och även läkaren då utsätts för kontinuerlig röntgenstrålning måste strålningsdosen hållas låg, vilket leder till brusiga bilder. Det är därför önskvärt att genom bildbehandling förbättra bilderna. Bildförbättringen måste dock ske i realtid och därför kan inte konventionella metoder användas.
Detta examensarbete avser att undersöka hur ortogonala s k. derivataoperatorer kan användas för att förbättra läsbarheten av fluoroskopibilder med hjälp av brusundertryckning och kantförstärkning. Derivataoperatorer är separerbara vilket gör dem extremt beräkningsvänliga och lätta att infoga i en skalpyramid. Skalpyramiden ger möjlighet att processa strukturer och detaljer av olika storlek var för sig samtidigt som nedsamplingsmekanismen gör att denna uppdelning inte nämnvärt ökar beräkningsbördan. I den fullständiga lösningen införes också struktur-/brusseparering för att förhindra förstärkning av och undertrycka bidrag från de frekvensband där en pixel domineras av brus.
Resultaten visar att brus verkligen kan undertryckas medan kanter och linjer bevaras bra eller förstärkes om så önskas. Den riktade filtreringen gör dock att det lätt uppstår maskliknande strukturer i bruset, men detta kan undvikas med rätt parameterinställning av struktur-/brussepareringen. Förhållandet mellan riktad och icke-riktad filtrering är likaledes styrbart via en parameter som kan optimeras med hänsyn till behov och önskemål vid varje tillämpning.
@mastersthesis{diva2:21733,
author = {Brolund, Hans},
title = {{Förbättring av fluoroskopibilder}},
school = {Linköping University},
type = {{LITH-ISY-EX-06/3823-SE}},
year = {2006},
address = {Sweden},
}
The objective of this master thesis was to study the performance of an active triangulation system for 3-D imaging in underwater applications. Structured light from a 20 mW laser and a conventional video camera was used to collect data for generation of 3-D images. Different techniques to locate the laser line and transform it into spatial coordinates were developed and evaluated. A field- and a laboratory trial were performed.
From the trials we can conclude that the distance resolution is much higher than the lateral- and longitudinal resolution. The lateral resolution can be improved either by using a high frame rate camera or simply by using a low scanning speed. It is possible to obtain a range resolution of less than a millimeter. The maximum range of vision was 5 meters under water measured on a white target and 3 meters for a black target in clear sea water. These results are however dependent on environmental and system parameters such as laser power, laser beam divergence and water turbidity. A higher laser power would for example increase the maximum range.
@mastersthesis{diva2:21659,
author = {Norström, Christer},
title = {{Underwater 3-D imaging with laser triangulation}},
school = {Linköping University},
type = {{LiTH-ISY-EX--06/3851--SE}},
year = {2006},
address = {Sweden},
}
To improve the control of a steel casting process ABB has developed an Electro Magnetic Brake (EMBR). This product is designed to improve steel quality, i.e. reduce non-metallic inclusions and blisters as well as risk of surface cracks. There is a demand of increasing the steel quality and in order to optimize the steel casting, simulations and experiments play an important role in achieving this. An advanced CFD simulation model has been created to carry out this task.
The validation of the simulation model is performed on a water model that has been built for this purpose. This water model also makes experiments possible. One step to the simulation model is to measure the velocity and motion pattern of the seeding particles and the air bubbles in the water model to see if it corresponds to the simulation results.
Since the water is transparent, seeding particles have been added to the liquid in order to observe the motion of the water. They have the same density as water. Hence the particles will follow the flow accurately. The motions of the air bubbles that are added into the water model need also to be observed since they influence the flow pattern.
An algorithm - ”Transparent motions” - is thoroughly inspected and implemented. ”Transparent motions” was originally designed to post process x-ray images. However in this thesis, it is investigated whether the algorithm might be applicable to the water model and the image sequences containing seeding particles and air bubbles that are going to be used for motion estimation.
The result show satisfying results for image sequences of particles only, however with a camera with a faster sampling interval, these results would improve. For image sequences with both bubbles and particles no results have been achieved.
@mastersthesis{diva2:21306,
author = {Gustafsson, Gabriella},
title = {{Multiphase Motion Estimation in a Two Phase Flow}},
school = {Linköping University},
type = {{LITH-ISY-EX--05/3723--SE}},
year = {2005},
address = {Sweden},
}
This thesis describes and evaluates a number of approaches and algorithms for nonuniform correction (NUC) and suppression of fixed pattern noise in a image sequence. The main task for this thesis work was to create a general NUC for infrared focal plane arrays. To create a radiometrically correct NUC, reference based methods using polynomial approximation are used instead of the more common scene based methods which creates a cosmetic NUC.
The pixels that can not be adjusted to give a correct value for the incomming radiation are defined as dead. Four separate methods of identifying dead pixels are used to find these pixels. Both the scene sequence and calibration data are used in these identifying methods.
The algorithms and methods have all been tested by using real image sequences. A graphical user interface using the presented algorithms has been created in Matlab to simplify the correction of image sequences. An implementation to convert the corrected values from the images to radiance and temperature is also performed.
@mastersthesis{diva2:21133,
author = {Isoz, Wilhelm},
title = {{Calibration of Multispectral Sensors}},
school = {Linköping University},
type = {{LiTH-ISY-EX--05/3651--SE}},
year = {2005},
address = {Sweden},
}
Denna uppsats syftar till att undersöka användbarheten av metoden Independent Component Analysis (ICA) för brusreducering av bilder tagna av infraröda kameror. Speciellt fokus ligger på att reducera additivt brus. Bruset delas upp i två delar, det Gaussiska bruset samt det sensorspecifika mönsterbruset. För att reducera det Gaussiska bruset används en populär metod kallad sparse code shrinkage som bygger på ICA. En ny metod, även den byggandes på ICA, utvecklas för att reducera mönsterbrus. För varje sensor utförs, i den nya metoden, en analys av bilddata för att manuellt identifiera typiska mönsterbruskomponenter. Dessa komponenter används därefter för att reducera mönsterbruset i bilder tagna av den aktuella sensorn. Det visas att metoderna ger goda resultat på infraröda bilder. Algoritmerna testas både på syntetiska såväl som på verkliga bilder och resultat presenteras och jämförs med andra algoritmer.
@mastersthesis{diva2:20831,
author = {Björling, Robin},
title = {{Denoising of Infrared Images Using Independent Component Analysis}},
school = {Linköping University},
type = {{LiTH-ISY-EX--05/3726--SE}},
year = {2005},
address = {Sweden},
}
Detta examensarbete utreder avståndsbedömning med hjälp av bildbehandling och stereoseende för känd kamerauppställning.
Idag existerar ett stort antal beräkningsmetoder för att få ut avstånd till objekt, men metodernas prestanda har knappt mätts. Detta arbete tittar huvudsakligen på olika blockbaserade metoder för avståndsbedömning och tittar på möjligheter samt begränsningar då man använder sig av känd kunskap inom bildbehandling och stereoseende för avståndsbedömning. Arbetet är gjort på Bofors Defence AB i Karlskoga, Sverige, i syfte att slutligen användas i ett optiskt sensorsystem. Arbetet utreder beprövade
Resultaten pekar mot att det är svårt att bestämma en närmask, avstånd till samtliga synliga objekt, men de testade metoderna bör ändå kunna användas punktvis för att beräkna avstånd. Den bästa metoden bygger på att man beräknar minsta absolutfelet och enbart behåller de säkraste värdena.
@mastersthesis{diva2:20786,
author = {Hedlund, Gunnar},
title = {{Närmaskbestämning från stereoseende}},
school = {Linköping University},
type = {{LiTH-ISY-EX--05/3623--SE}},
year = {2005},
address = {Sweden},
}
This report develops a method for probabilistic conceptual sensor modeling. The idea is to generate probabilities for detection, recognition and identification based on a few simple factors. The
focus lies on FLIR sensors and thermal radiation, even if discussions of other wavelength bands are made. The model can be used as a hole or some or several parts can be used to create a simpler model. The core of the model is based on the Johnson criteria that uses resolution as the input parameter. Some extensions that models other factors are also implemented. In the end a short discussion of the possibility to use this model for other sensors than FLIR is made.
@mastersthesis{diva2:20633,
author = {Sonesson, Mattias},
title = {{A Probabilistic Approach to Conceptual Sensor Modeling}},
school = {Linköping University},
type = {{LITH-ISY-EX-3428-2004}},
year = {2005},
address = {Sweden},
}
The purpose of this master thesis, performed at FOI, was to evaluate a range gated underwater camera, for the application identification of bottom objects. The master thesis was supported by FMV within the framework of “arbetsorder Systemstöd minjakt (Jan Andersson, KC Vapen)”. The central part has been field trials, which have been performed in both turbid and clear water. Conclusions about the performance of the camera system have been done, based on resolution and contrast measurements during the field trials. Laboratory testing has also been done to measure system specific parameters, such as the effective gate profile and camera gate distances.
The field trials shows that images can be acquired at significantly longer distances with the tested gated camera, compared to a conventional video camera. The distance where the target can be detected is increased by a factor of 2. For images suitable for mine identification, the increase is about 1.3. However, studies of the performance of other range gated systems shows that the increase in range for mine identification can be about 1.6. Gated viewing has also been compared to other technical solutions for underwater imaging.
@mastersthesis{diva2:20570,
author = {Andersson, Adam},
title = {{Range Gated Viewing with Underwater Camera}},
school = {Linköping University},
type = {{LITH-ISY-EX--05/3718--SE}},
year = {2005},
address = {Sweden},
}
Just how far is it possible to make learning of new parts for recognition and robot picking autonomous? This thesis initially gives the prerequisites for the steps in learning and calibration that are to be automated. Among these tasks are to select a suitable part model from numerous candidates with the help of a new part segmenter, as well as computing the spatial extent of this part, facilitating robotic collision handling. Other tasks are to analyze the part model in order to highlight correct and suitable edge segments for increasing pattern matching certainty, and to choose appropriate acceptance levels for pattern matching. Furthermore, tasks deal with simplifying camera calibration by analyzing the calibration pattern, as well as compensating for differences in perspective at great depth variations, by calculating the centre of perspective of the image. The image processing algorithms created in order to solve the tasks are described and evaluated thoroughly. This thesis shows that simplification of steps of learning and calibration, by the help of advanced image processing, really is possible.
@mastersthesis{diva2:19024,
author = {Wernersson, Björn and Södergren, Mikael},
title = {{Automatiserad inlärning av detaljer för igenkänning och robotplockning}},
school = {Linköping University},
type = {{LiTH-ISY-EX--05/3755--SE}},
year = {2005},
address = {Sweden},
}
This report adresses the problem of software correction of spatially variant blur in digital images. The problem arises when the camera optics contains flaws, when the scene contains multiple moving objects with different relative motion or the camera itself is i.e. rotated. Compensation through deconvolving is impossible due to the shift-variance in the PSF hence alternative methods are required. There are a number of suggested methods published. This report evaluates two methods
@mastersthesis{diva2:20290,
author = {Andersson, Mathias},
title = {{Image processing algorithms for compensation of spatially variant blur}},
school = {Linköping University},
type = {{LITH-ISY-EX--05/3633--SE}},
year = {2005},
address = {Sweden},
}
This thesis describes new methods for automatic crack detection in pavements. Cracks in pavements can be used as an early indication for the need of reparation.
Automatic crack detection is preferable compared to manual inventory; the repeatability can be better, the inventory can be done at a higher speed and can be done without interruption of the traffic.
The automatic and semi-automatic crack detection systems that exist today use Image Analysis methods. There are today powerful methods available in the area of Computer Vision. These methods work in higher dimensions with greater complexity and generate measures of local signal properties, while Image Analyses methods for crack detection use morphological operations on binary images.
Methods for digitalizing video data on VHS-cassettes and stitching images from nearby frames have been developed.
Four methods for crack detection have been evaluated, and two of them have been used to form a crack detection and classification program implemented in the calculation program Matlab.
One image set was used during the implementation and another image set was used for validation. The crack detection system did perform correct detection on 99.2 percent when analysing the images which were used during implementation. The result of the crack detection on the validation data was not very good. When the program is being used on data from other pavements than the one used during implementation, information about the surface texture is required to calibrate the crack detection.
@mastersthesis{diva2:20160,
author = {Håkansson, Staffan},
title = {{Detektering av sprickor i vägytor med hjälp av Datorseende}},
school = {Linköping University},
type = {{LITH-ISY-EX--05/3699--SE}},
year = {2005},
address = {Sweden},
}
Contemporary algorithms employed for reconstruction of 3D volumes from helical cone beam projections are so called non-exact algorithms. This means that the reconstructed volumes will contain artifacts irrespective of the detector resolution and number of projections angles employed in the process.
It has been proposed that these artifacts can be suppressed using an iterative scheme which comprises computation of projections from the already reconstructed volume as well as the non-exact reconstruction itself.
The purpose of the present work is to examine if the iterative scheme can be applied to the non-exact reconstruction method PI-original in order to improve the reconstruction result. An important part in this implementation is a careful design of the projection operator, as a poorly designed projection operator may result in aliasing and/or other artifacts in the reconstruction result. Since the projection data is truncated, special care must be taken along the boundaries of the detector. Three different ways of handling this interpolation problem is proposed and examined.
The results show that artifacts caused by the PI-original method can indeed be reduced by the iterative scheme. However, each iteration requires at least three times more processing time than the initial reconstruction, which may call for certain compromises, smartness and/or parallelization in the innermost loops. Furthermore, at higher cone angles certain types of artifacts seem to grow by each iteration instead of being suppressed.
@mastersthesis{diva2:19912,
author = {Sunnegårdh, Johan},
title = {{Iterative Enhancement of Non-Exact Reconstruction in Cone Beam CT}},
school = {Linköping University},
type = {{LITH-ISY-EX--04/3646--SE}},
year = {2004},
address = {Sweden},
}
Den här rapporten beskriver och utvärderar ett antal algoritmer för multisensordatafusion av radar och IR/TV-data på rådatanivå. Med rådatafusion menas att fusionen ska ske innan attribut- eller objektextrahering. Attributextrahering kan medföra att information går förlorad som skulle kunna förbättra fusionen. Om fusionen sker på rådatanivå finns mer information tillgänglig och skulle kunna leda till en förbättrad attributextrahering i ett senare steg. Två tillvägagångssätt presenteras. Den ena metoden projicerar radarbilden till IR-vyn och vice versa. Fusionen utförs sedan på de par av bilder med samma dimensioner. Den andra metoden fusionerar de två ursprungliga bilderna till en volym. Volymen spänns upp av de tre dimensionerna representerade i ursprungsbilderna. Metoden utökas också genom att utnyttja stereoseende. Resultaten visar att det kan vara givande att utnyttja stereoseende då den extra informationen underlättar fusionen samt ger en mer generell lösning på problemet.
@mastersthesis{diva2:19523,
author = {Schultz, Johan},
title = {{Sensordatafusion av IR- och radarbilder}},
school = {Linköping University},
type = {{}},
year = {2004},
address = {Sweden},
}
By analyzing ISAR images, the characteristics of military platforms with respect to radar visibility can be evaluated. The method, which is based on the Discrete-Time Fourier Transform (DTFT), that is currently used to calculate the ISAR images requires large computations efforts. This thesis investigates the possibility to replace the DTFT with the Fast Fourier Transform (FFT). Such a replacement is not trivial since the DTFT is able to compute a contribution anywhere along the spatial axis while the FFT delivers output data at fixed sampling, which requires subsequent interpolation. The interpolation leads to a difference in the ISAR image compared to the ISAR image obtained by DTFT. On the other hand, the FFT is much faster. In this quality-and-time trade-off, the objective is to minimize the error while keeping high computational efficiency.
The FFT-approach is evaluated by studying execution time and image error when generating ISAR images for an aircraft model in a controlled environment. The FFT method shows good results. The execution speed is increased significantly without any visible differences in the ISAR images. The speed-up- factor depends on different parameters: image size, degree of zero-padding when calculating the FFT and the number of frequencies in the input data.
@mastersthesis{diva2:19402,
author = {Dahlbäck, Niklas},
title = {{Implementation of a fast method for reconstruction of ISAR images}},
school = {Linköping University},
type = {{}},
year = {2003},
address = {Sweden},
}
A complete prototype system for measuring vehicle lateral position has been set up during the course of this master’s thesis project. In the development of the software, images acquired from a back-ward looking video camera mounted on the roof of the vehicle were used.
The problem of using computer vision to measure lateral position can be divided into road marking detection and lateral position extraction. Since the strongest characteristic of a road marking image are the edges of the road markings, the road marking detection step is based on edge detection. For the detection of the straight edge lines a Hough based method was chosen. Due to peak spreading in Hough space, the difficulty of detecting the correct peak in Hough space was encountered. A flexible Hough peak detection algorithm was developed based on an adaptive window that takes peak spreading into account. The road marking candidate found by the system is verified before the lateral position data is generated. A good performance of the road marking tracking algorithm was obtained by exploiting temporal correlation to update a search region within the image. A camera calibration made the extraction of real-world lateral position information and yaw angle data possible.
This vision-based method proved to be very accurate. The standard deviation of the error in the position detection is 0.012 m within an operating range of ±2 m from the image centre. During continuous road markings the rate of valid data is on average 96 %, whereas it drops to around 56 % for sections with intermittent road markings. The system performs well during lane change manoeuvres, which is an indication that the system tracks the correct road marking. This prototype system is a robust and automatic measurement system, which will benefit VTI in its many driving behaviour research programs.
@mastersthesis{diva2:19311,
author = {Ågren, Elisabeth},
title = {{Lateral Position Detection Using a Vehicle-Mounted Camera}},
school = {Linköping University},
type = {{}},
year = {2003},
address = {Sweden},
}
This thesis describes and evaluates a number of algorithms for reducing fixed pattern noise in image sequences. Fixed pattern noise is the dominantnoise component for many infrared detector systems, perceived as a superimposed pattern that is approximately constant for all image frames.
Primarily, methods based on estimation of the movement between individual image frames are studied. Using scene-matching techniques, global motion between frames can be successfully registered with sub-pixel accuracy. This allows each scene pixel to be traced along a path of individual detector elements. Assuming a static scene, differences in pixel intensities are caused by fixed pattern noise that can be estimated and removed.
The algorithms have been tested by using real image data from existing infrared imaging systems with good results. The tests include both a two-dimensional focal plane array detector and a linear scanning one-dimensional detector, in different scene conditions.
@mastersthesis{diva2:19078,
author = {Torle, Petter},
title = {{Scene-based correction of image sensor deficiencies}},
school = {Linköping University},
type = {{}},
year = {2003},
address = {Sweden},
}
This master's thesis develops an algorithm for tracking of cars robust enough to handle turning cars. It is implemented in the image processing environment Image Processing Application Programming Interface (IPAPI) for use with the WITAS project.
Firstly, algorithms, comparable with one currently used in the WITAS-project, are studied. The focus is on how rotation, that originates from the turning of the cars, affects tracking performance. The algorithms studied all perform an exhaustive search over a region, close to the last known position of the object being tracked, to find a match. After this, an iterative algorithm, based on the idea that a car can only rotate, translate and change scale, is introduced. The algorithm estimates the parameters describing this rotation, translation, and change of scale, iteratively. The iterative process needs a initial parameter estimate that is accurate enough for the algorithm to converge. The developed algorithm is based on an earlier publication on the subject, however the mathematical description, and deduction, of it is taken one step further than in this publication.
The iterative algorithm used performs well under the assumption that the data used fulfills some basic criteria. These demands comprises: placement of camera, template size as well as how the parameters may vary between two observations. The iterative algorithm is also potentially faster than exhaustive search methods, because few iterations are needed when the parameters change slowly. Better initial parameters should improve stability and speed of convergation. Other suggestions that could give better performance is discussed, e.g., methods to better extract the target from the surroundings.
@mastersthesis{diva2:19030,
author = {Öberg, Per},
title = {{Tracking by Image Processing in a Real Time System}},
school = {Linköping University},
type = {{}},
year = {2003},
address = {Sweden},
}
A Transaction Reproduction System (ARTSY) is a distributed system that enables secure transactions and reproductions of digital content over an insecure network. A field of application is reproductions of visual arts: A print workshop could for example use ARTSY to print a digital image that is located at a remote museum. The purpose of this master thesis project was to propose a specification for ARTSY and to show that it is technically feasible to implement it.
An analysis of the security threats in the ARTSY context was performed and a security model was developed. The security model was approved by a leading computer security expert. The security mechanisms that were chosen for the model were: Asymmetric cryptology, digital signatures, symmetric cryptology and a public key registry. A Software Requirements Specification was developed. It contains extra directives for image reproduction systems but it is possible to use it for an arbitrary type of reproduction system. A prototype of ARTSY was implemented using the Java programming language. The prototype uses XML to manage information and Java RMI to enable remote communication between its components. It was built as a platform independent system and it has been tested and proven to be operational on the Sun Solaris platform as well as the Win32 platform.
@mastersthesis{diva2:18935,
author = {Björk, Mårten and Max, Sofia},
title = {{ARTSY:
A Reproduction Transaction System}},
school = {Linköping University},
type = {{}},
year = {2003},
address = {Sweden},
}
This Master’s thesis studies the possibility of using image processing as a tool to facilitate vine management, in particular shoot counting and assessment of the grapevine canopy. Both are areas where manual inspection is done today. The thesis presents methods of capturing images and segmenting different parts of a vine. It also presents and evaluates different approaches on how shoot counting can be done. Within canopy assessment, the emphasis is on methods to estimate canopy density. Other possible assessment areas are also discussed, such as canopy colour and measurement of canopy gaps and fruit exposure. An example of a vine assessment system is given.
@mastersthesis{diva2:18665,
author = {Bjurström, Håkan and Svensson, Jon},
title = {{Assessment of Grapevine Vigour Using Image Processing}},
school = {Linköping University},
type = {{}},
year = {2002},
address = {Sweden},
}
This is a thesis written for a master's degree at the Computer Vision Laboratory, University of Linköping. An abstract outer product is defined and used as a bridge to reach 2:nd and 4:th order tensors. Some applications of these in geometric analysis of range data are discussed and illustrated. In idealized setups, simple geometric objects, like spheres or polygons, are successfully detected. Finally, the generalization to n:th order tensors for storing and analysing geometric information is discussed.
@mastersthesis{diva2:18558,
author = {Eidehall, Andreas},
title = {{Tensor representation of 3D structures}},
school = {Linköping University},
type = {{}},
year = {2002},
address = {Sweden},
}
The purpose of this thesis is to investigate the applicability of a certain model based classification algorithm. The algorithm is centered around a flexible wireframe prototype that can instantiate a number of different vehicle classes such as a hatchback, pickup or a bus to mention a few. The parameters of the model are fitted using Newton minimization of errors between model line segments and observed line segments. Furthermore a number of methods for object detection based on motion are described and evaluated. Results from both experimental and real world data is presented.
@mastersthesis{diva2:18561,
author = {Böckert, Andreas},
title = {{Vehicle detection and classification in video sequences}},
school = {Linköping University},
type = {{}},
year = {2002},
address = {Sweden},
}
The purpose of this master's thesis is to evaluate whether it is feasible to use the panchromatic band of Landsat 7 in order to improve the spatial resolution of colour images. The images are to be used as texture in visual databases for flight simulators and for this reason it is important that the fusion preserves natural colours.
A number of methods for fusing panchromatic and multispectral images are discussed. Four of them are implemented and evaluated. The result is that standard methods such as HSI substitution are not suitable for this purpose since they do not preserve natural colours. However, if only high frequencies of the panchromatic image are used the resolution can be improved without noticeable colour distortion.
@mastersthesis{diva2:17912,
author = {Molin, Sara},
title = {{Förbättring av upplösningen i Landsat 7-bilder med hjälp av bildfusion}},
school = {Linköping University},
type = {{}},
year = {2002},
address = {Sweden},
}
Turning solar collectors, heliostats, is certainly not a new idea but has been explored for at least two decades. Projects on this subject have resulted in more or less realistic constructions, in a commercial point of view. Far too often the technical goals have had higher priority than the economical, which has resulted in a few constructions having the ability to compete with conventional, fix solar collectors. The economical issues have been given high priority in this project, without decreasing the demands on reliability. The system has been given the following mechanical and electronical properties: One-axis movement, fix heat carrying fluid system, microcomputer controlled movement and automatic protection from overheating. According to the development in digital technology with lower prices on advanced semiconductors as a consequence, the conclusion is that the prerequisites of this concept will be even better in the future. The result of this thesis is a heliostat function that increases the energy gain by up to 40%, compared to a field of MaReCo collectors without this function. Though, the cost only increases by 13%.
@mastersthesis{diva2:17448,
author = {Svensson, Mikael},
title = {{Utveckling av styrning till solföljande MaReCo-hybrid i Hammarby Sjöstad}},
school = {Linköping University},
type = {{}},
year = {2002},
address = {Sweden},
}
This thesis presents a 3D semi-automatic segmentation technique for extracting the lumen surface of the Carotid arteries including the bifurcation from 3D and 4D ultrasound examinations.
Ultrasound images are inherently noisy. Therefore, to aid the inspection of the acquired data an adaptive edge preserving filtering technique is used to reduce the general high noise level. The segmentation process starts with edge detection with a recursive and separable 3D Monga-Deriche-Canny operator. To reduce the computation time needed for the segmentation process, a seeded region growing technique is used to make an initial model of the artery. The final segmentation is based on the inflatable balloon model, which deforms the initial model to fit the ultrasound data. The balloon model is implemented with the finite element method.
The segmentation technique produces 3D models that are intended as pre-planning tools for surgeons. The results from a healthy person are satisfactory and the results from a patient with stenosis seem rather promising. A novel 4D model of wall motion of the Carotid vessels has also been obtained. From this model, 3D compliance measures can easily be obtained.
@mastersthesis{diva2:17818,
author = {Mattsson, Per and Eriksson, Andreas},
title = {{Segmentation of Carotid Arteries from 3D and 4D Ultrasound Images}},
school = {Linköping University},
type = {{}},
year = {2002},
address = {Sweden},
}
Face detection and pose estimation are two widely studied problems - mainly because of their use as subcomponents in important applications, e.g. face recognition. In this thesis I investigate a new approach to the general problem of object detection and pose estimation and apply it to faces. Face detection can be considered a special case of this general problem, but is complicated by the fact that faces are non-rigid objects. The basis of the new approach is the use of scale and orientation invariant feature structures - feature triplets - extracted from the image, as well as a biologically inspired associative structure which maps from feature triplets to desired responses (position, pose, etc.). The feature triplets are constructed from curvature features in the image and coded in a way to represent distances between major facial features (eyes, nose and mouth). The final system has been evaluated on different sets of face images.
@mastersthesis{diva2:17324,
author = {Isaksson, Marcus},
title = {{Face Detection and Pose Estimation using Triplet Invariants}},
school = {Linköping University},
type = {{}},
year = {2002},
address = {Sweden},
}
The aim of this master thesis is to classify the tree class from an image of a leaf with a computer vision classiffication system. We compare different descriptors that will describe the leaves different features. We will also look at different classiffication models and combine them with the descriptors to build a system hat could classify the different tree classes.
@mastersthesis{diva2:303038,
author = {Söderkvist, Oskar},
title = {{Computer Vision Classification of Leaves from Swedish Trees}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 3132}},
year = {2001},
address = {Sweden},
}
This Master's Thesis discusses the different trade-offs a programmer needs to consider when constructing image processing systems. First, an overview of the different alternatives available is given followed by a focus on systems based on general hardware. General, in this case, means mass-market with a low price-performance-ratio. The software environment is focused on UNIX, sometimes restricted to Linux, together with C, C++ and ANSI-standardized APIs.
@mastersthesis{diva2:303037,
author = {Nordlöv, Per},
title = {{Implementation Aspects of Image Processing}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 3088}},
year = {2001},
address = {Sweden},
}
The purpose of this master’s thesis was to study the possibility to use computer vision methods to detect and classify objects in the front passenger seat in a car. This work presents different approaches to solve this problem and evaluates the usefulness of each technique. The classification information should later be used to modulate the speed and the force of the airbag, to be able to provide each occupant with optimal protection and safety.
This work shows that computer vision has a great potential in order to provide data, which may be used to perform reliable occupant classification. Future choice of method to use depends on many factors, for example costs and requirements on the system from laws and car manufacturers. Further, evaluation and tests of the methods in this thesis, other methods, the ABE approach and post-processing of the results should also be made before a reliable classification algorithm may be written.
@mastersthesis{diva2:303034,
author = {Klomark, Marcus},
title = {{Occupant Detection using Computer Vision}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 3026}},
year = {2000},
address = {Sweden},
}
We explore the use of colour for interpretation of unstructured off-road scenes. The aim is to extract driveable areas for use in an autonomous off-road vehicle in real-time. The terrain is an unstructured tropical jungle area with vegetation, water and red mud roads.
We show that hue is both robust to changing lighting conditions and an important feature for correctly interpreting this type of scene. We believe that our method also can be deployed in other types of terrain, with minor changes, as long as the terrain is coloured and well saturated.
Only 2D information is processed at the moment, but we aim at extending the method to also treat 3D information, by the use of stereo vision or motion.
@mastersthesis{diva2:303033,
author = {Bergquist, Urban},
title = {{Colour Vision and Hue for Autonomous Vehicle Guidance}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 2091}},
year = {1999},
address = {Sweden},
}
This thesis investigates the possibilities of using GIS (Geographic Information System) data with an airborne autonomous vehicle developed in the WITAS project. Available for the thesis are high resolution (0.16 meter sample interval) aerial photographs over Stockholm, and vector data in a common GIS format containing all roads in the Stockholm area.
A method for removing cars from aerial photographs is presented, using the filtering method normalized convolution, originally developed for filtering uncertain and incomplete data. By setting the certainty to zero over the cars, this data is disregarded in the filtering process, resulting in an image without cars. This method is further improved by choosing an anisotropic applicability function, resulting in a filtering that preserves structures oriented in certain directions.
The available vector data is investigated with regard to its use in a simulator for vehicle movement, and is found to be missing much of the essential information needed in such a simulator. A new data format better suited to these requirements is created, using the extensible markup language (XML) which generates a humanreadable data format and can use existing parsers to make the implementation simpler. The result is a somewhat complex, but highly general data format that can accurately express almost any type of road and intersection. Cars can follow arbitrary paths in the road database and move with a smooth motion suitable for use as input to image processing equipment. The simulator does not allow any dynamic behaviour such as changing speeds, starting or stopping, or interaction between cars, takeovers or intelligent behavior in intersections.
In the airborne vehicle, a mapping from pixels in a camera image (like the ones output from the simulator) to locations in the road database is needed. This is an inverse mapping with respect to visualizing as described above. This gives important information to a car tracking system regarding the probable movement of cars and also making it possible to determine if a car breaks traffic regulations. A mapping of this kind is created using a simplified form of ray tracing known as ray casting, together with space partitioning methods used to vastly improve efficiency.
All above mentioned tasks are implemented using C++ and object oriented methods, giving maintainable and extendable code suiting a quickly changing research area. The interface to the simulator is designed to be compatible to the existing simulation software used in the WITAS project. Visualization is done through the OpenGL graphics library, providing realistic effects such as lighting and shading.
@mastersthesis{diva2:303032,
author = {Langemark, Stefan},
title = {{GIS in a simulator environment and efficient inverse mapping of roads}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 2090}},
year = {1999},
address = {Sweden},
}
Automated storage systems often rely on that the positions of the pallets are known with high precision. In this thesis, a turnable camera mounted on the robot has been used for handling the situation of approximately known pallet positions. The robot is given the approximate location of a pallet, and its objective is to locate the pallet with a precision that is high enough to be able to approach it from the correct direction and then lift it. For this, a precision of a few centimetres in each direction is needed.
A system for locating the pallet from single images, based on rotational symmetry filters, has been developed, and a simple program for controlling the robot has been implemented. These could very well be extended and improved, e.g. by considering multiple images and improving the path planning.
The main part of the thesis deals with the image processing part. Other parts of the project, apart from the controller, include implementation of servers controlling the camera and the frame grabber.
Some tests have been made, which show fairly promising results.
@mastersthesis{diva2:303029,
author = {Roll, Jakob},
title = {{A System for Visual-Based Automated Storage Robots}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 2053}},
year = {1999},
address = {Sweden},
}
Erfarenheter från tidigare försök på Korsnäs AB visar att det är väldigt svårt att på matematisk väg förutsäga vad som händer under framställningen av pappersmassa i en kontinuerlig kokare.
Målet med detta examensarbete var att undersöka möjligheterna att med hjälp av neurala nätverk underlätta regleringen genom att prediktera ligninhalten hos pappersmassan tre och en halv timme innan den aktuella flisen är färdigkokt.
På grund av den, med produktionstakten, varierande tidsförskjutningen mellan olika givarsignaler löstes problemet med en enkel, lokal modell per produktionstakt. Alla ingående modeller minimeras med avseende på både antalet noder i det gömda lagret och antalet ingångar, vilket gav en slutlig lösning med fyra enkla modeller uppbyggda av framåtkopplade neurala nätverk, var och ett med ett gömt lager innehållandes tre noder.
Prediktionen av ligninhalten påvisade till slut goda egenskaper, med avseende på hur väl prediktionen följer den verkliga kappatalsanalysatorn.
@mastersthesis{diva2:303022,
author = {Stewing, Robert},
title = {{Parameterprediktering med multipla sammansatta lokala neuronnätsbaserade modeller vid framställning av pappersmassa}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 1991}},
year = {1999},
address = {Sweden},
}
In this report, mainly three different problems are considered. The first problem considered is how to filter position data of vehicles. To do so the vehicles have to be tracked. This is done with Kalman filters. The second problem considered is how to control a camera to keep a vehicle in the center of the image, under three different conditions. This is mainly solved with a Kalman filter. The last problem considered is how to use the color of the vehicles to make classification of them more robust. Some suggestions on how this might be done are given. However, no really good method to do this has been found.
@mastersthesis{diva2:530596,
author = {Moe, Anders},
title = {{Investigations in Tracking and Colour Classification}},
school = {Linköping University},
type = {{}},
year = {1998},
address = {Sweden},
}
Chapter 2 describes the concept of canonical correlation. This you have to know about in order to understand the continuing discussion.
Chapter 3 introduce you to the problem that was to be solved.
Chapter 4, 5 and 6 discusses three different suggestions how to approach the problem. Each chapter begins with a section of experiments as a motivation of the approach. Then follows some theory and mathematical manipulations to structure the thoughts. The last sections contains discussions and suggestions concerning the approach.
Finally chapter 7 contains a summary and a comparismental discussion of the approaches.
@mastersthesis{diva2:303009,
author = {Johansson, Björn},
title = {{Multidimensional signal recognition, invariant to affine transformation and time-shift, using canonical correlation}},
school = {Linköping University},
type = {{LiTH-ISY-EX-1825}},
year = {1997},
address = {Sweden},
}
Segmentation is a process that separates objects in an image. In medical images, particularly image volumes, the field of application is wide. For example 3D visualisations of the anatomy could benefit enormously from segmentation. The aim of this thesis is to construct a segmentation tool.
The project consist three main parts. First, a survey of the actual need of segmentation in medical image volumes was carried out. Then a unique three-step model for a segmentation tool was implemented, tested and evaluated.
The first step of the segmentation tool is a seed-growing method that uses the intensity and an orientation tensor estimate to decide which voxels that are part of the project. The second step uses an active contour, a deformable “balloon”. The contour is shrunk to fit the segmented border from the first step, yielding a surface suitable for visualisation. The last step consists of letting the contour reshape according to the orientation tensor estimate.
The use evaluation establishes the usefulness of the tool. The model is flexible and well adapted to the users’ requests. For unclear objects the segmentation may fail, but the cause is mostly poor image quality. Even though much work remains to be done on the second and third part of the tool, the results are most promising.
@mastersthesis{diva2:303019,
author = {Lundström, Claes},
title = {{Segmentation of Medical Image Volumes}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 1864}},
year = {1997},
address = {Sweden},
}
In this report, the principles of man-made object detection in satellite images is investigated. An overview of terminology and of how the detection problem is usually solved today is given. A three level system to solve the detection problem is proposed. The main branches of this system handle road, and city detection respectively. To achieve data source flexibility, the Logical Sensor notion is used to model the low level system components. Three Logical Sensors have been implemented and tested on Landsat TM and SPOT XS scenes. These are: BDT (Background Discriminant Transformation) to construct a man-made object property field; Local-orientation for texture estimation and road tracking; Texture estimation using local variance and variance of local orientation. A gradient magnitude measure for road seed generation has also been tested.
@mastersthesis{diva2:303014,
author = {Forss\'{e}n, Per-Erik},
title = {{Detection of Man-made Objects in Satellite Images}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 1852}},
year = {1997},
address = {Sweden},
}
Artificiella neurala nätverk (ANN) är en teknik som under de senaste tio åren har mognat och som numera återfinns i allt fler tillämpningar så som avläsning av skriven text, linjär programmering, reglerteknik, expertsystem, taligenkänning och många olika sorters klassificeringsproblem [Zurada, 1992]. Vi ville i vårt examensarbete försöka använda ANN i en industriell process där standardmetoder ej fungerat tillfredsställande eller varit svåra att tillämpa. En sådan process fann vi i tillverkningen av pappersmassa.
För att tillverka pappersmassa från ved krävs en lång och komplicerad process uppdelad i flera olika steg. Ett av dessa steg är den så kallade kokningen där man med hjälp av högt tryck och varm lut bryter ned träflis till fibrer. Kokningsprocessen är komplex, pågår under lång tid (ca. 8 timmar) samt påverkas av en stor mängd parametrar och därför krävs det stor erfarenhet och kunskap för att kunna styra den. På Kværner Pulping Technologies i Karlstad, som konstruerar bl.a. kokare, har man tagit fram en simulator för kokningsprocessen för att man skall få en bättre insikt i hur processen fungerar och följaktligen kunna styra kokningen på ett bättre sätt. Simulatorns beteende är beroende av ett antal s.k. dolda parametrar som är en delmängd av de parametrar som antas påverka kokningsprocessen. Dessa dolda parametrar är svåra/omöjliga att mäta och därför sätts dessa i simuleringen till estimerade värden. De, i den riktiga processen, motsvarande dolda parametrarna varierar dock på ett okänt sätt. De påverkas dels av interna processer i kokaren, dels av externa orsaker, t.ex. kan träflis av en annan kvalitet matas in i kokaren. Detta leder till simulatorn ger bra simuleringar under ganska kort tid då de dolda parametrarna är approximativt konstanta.
Om man på något sätt skulle kunna detektera förändringarna i de dolda parametrarna i processen och föra över dessa till simulatorn, skulle den kunna gå "parallellt" med kokprocessen. Simulatorn skulle i detta fall utgöra ett utmärkt kompletterande verktyg för den person som styr kokprocessen, eftersom han/hon skulle få en bättre uppfattning om vad som händer/hände i processen och därmed få ett större beslutsunderlag för styrning. Detta förutsätter att simulatorn är så pass bra att den under stationära förhållanden i parametrarna lyckas fånga den globala utvecklingen i kokaren med tillräcklig precision.
Som ett första steg för att nå detta mål avser vi i denna rapport att undersöka om detektering av förändringar i de dolda parametrarna i simulatorn är möjlig med hjälp av framåtkopplade ANN och inlärningsalgoritmen resilient propagation.
Rapporten är uppdelad i 7 kapitel där vi i kapitel 2 kommer behandla problemet mer i detalj. Kapitel 3 och 4 är av allmänt slag där vi beskriver tillverkningsprocessen för papper och vad artificiella neurala nätverk egentligen är. I kapitel 5 beskriver vi de olika lösningsförslag som behandlats och de resultat vi har uppnått. Slutsatser och resultat sammanfattas i kapitel 6 . Det finns mycket mer vi skulle vilja pröva på och undersöka, dessa fortsatta arbeten beskriver vi kapitel 7. Sist i rapporten kommer bilagorna 1 och 2 med detaljer som vi finner relevanta, men som är för skrymmande att ta med i huvuddelen av rapporten. I bilaga 3 har vi bifogat den programkod vi producerat under arbetets gång.
@mastersthesis{diva2:302994,
author = {Andersson, Thord and Karlsson, Mikael},
title = {{Neuronnätsbaserad identifiering av processparametrar vid tillverkning av pappersmassa}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 1709}},
year = {1997},
address = {Sweden},
}
To find a shape in an image, a technique called snakes or active contours can be used. An active contour is a curve that moves towards the sought-for shape in a way controlled by internal forces - such as rigidity and elasticity - and an image force. The image force should attract the contour to certain features, such as edges, in the image. This is done by creating an attractor image, which defines how strongly each point in the image should attract the contour.
In this thesis the extension to contours (surfaces) in three dimensional images is studied. Methods of representation of the contour and computation of the internal forces are treated.
Also, a new way of creating the attractor image, using the orientation tensor to detect planar structure in 3D images, is studied. The new method is not generally superior to those already existing, but still has its uses in specific applications.
During the project, it turned out that the main problem of active contours in 3D images was instability due to strong internal forces overriding the influence of the attractor image. The problem was solved satisfactory by projecting the elasticity force on the contour’s tangent plane, which was approximated efficiently using sphere-fitting.
@mastersthesis{diva2:302987,
author = {Ahlberg, Jörgen},
title = {{Active Contours in Three Dimensions}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 1708}},
year = {1996},
address = {Sweden},
}
This Master's Thesis addresses the problem of segmenting an image sequence with respect to the motion in the sequence. As a basis for the motion estimation, 3D orientation tensors are used. The goal of the segmentation is to partition the images into regions, characterized by having a coherent motion. The motion model is affine with respect to the image coordinates. A method to estimate the parameters of the motion model from the orientation tensors in a region is presented. This method can also be generalized to a large class of motion models.
Two segmentation algorithms are presented together with a postprocessing algorithm. All these algorithms are based on the competitive algorithm, a general method for distributing points between a number of regions, without relying on arbitrary threshold values. The first segmentation algorithm segments each image independently, while the second algorithm recursively takes advantage of the previous segmentation. The postprocessing algorithm stabilizes the segmentations of a whole sequence by imposing continuity constraints.
The algorithms have been implemented and the results of applying them to a test sequence are presented. Interesting properties of the algorithms are that they are robust to the aperture problem and that they do not require a dense velocity ¯eld.
It is finally discussed how the algorithms can be developed and improved. It is straightforward to extend the algorithms to base the segmentations on alternative or additional features, under not too restrictive conditions on the features.
@mastersthesis{diva2:302971,
author = {Farnebäck, Gunnar},
title = {{Motion-based segmentation of image sequences}},
school = {Linköping University},
type = {{LiTH-ISY-Ex No. 1596}},
year = {1996},
address = {Sweden},
}
Last updated: 2010-08-26