We propose a new method for learning a representation of image motion in an unsupervised fashion. We do so by learning an image sequence embedding that respects associativity and invertibility properties of composed sequences with known temporal order. Our network architecture consists of a CNN followed by an LSTM and is structured to learn motion representations over sequences of arbitrary length. We demonstrate that a network trained using our unsupervised procedure on real-world sequences of human actions and vehicle motion can capture semantic regions corresponding to the motion in the scene, and not merely image-level differences, without requiring any motion labels. We present results that suggest our method can be used to extract information useful for independent motion tracking, localization, nearest neighbor identification, and other motion tasks.
We propose robust methods for estimating camera egomotion in noisy, real-world monocular image sequences. This is a difficult problem because of the nonconvex cost function of the perspective camera motion equation and because of non-Gaussian noise arising from noisy optical flow estimates and scene non-rigidity. We introduce the expected residual likelihood method (ERL), which estimates confidence weights for noisy optical flow data using likelihood distributions of the residuals of the flow field under a range of counterfactual model parameters. ERL is effective at identifying outliers and recovering appropriate confidence weights in many settings. On the KITTI dataset, it outperforms baseline strategies using RANSAC and a novel robust lifted kernel method while adding almost no runtime cost over non-robust methods.
An essential task of the auditory system is to discriminate between different communication signals. To do this, the auditory system is thought to build a representation that is invariant to acoustic transformations. To investigate this hypothesis, we recorded the responses of populations of neurons in primary and non-primary auditory cortex to raw and acoustically distorted vocalizations. We found that populations of neurons in the non-primary auditory cortex exhibited greater invariance in encoding vocalizations over acoustic transformations than neuronal populations in the primary auditory cortex. These findings support the hypothesis that invariant representations are created gradually through hierarchical transformation within the auditory pathway.
We review models of V1 simple and complex cell receptive fields in light of recent evidence from intracellular cortical recordings. Intracellular recordings can capture subthreshold neuronal responses and these data reveal a more diverse set of functional subunits than predicted by the classic Hubel-Wiesel receptive field model or seen in spiking data. This evidence suggests that simple and complex properties arise due to different balances of thalamic and intracortical input, rather than being determined primarily by synaptic input from a few dominant sources.
We examine the causal relationship between the phase of alpha oscillations and conscious visual perception using rhythmic TMS (rTMS) while simultaneously recording EEG activity. rTMS of posterior parietal cortex at an alpha frequency entrained the phase of subsequent alpha oscillatory activity and produced a phase-dependent change on subsequent visual perception and discrimination accuracy. By extrinsically manipulating the phase of alpha before stimulus presentation, we provide direct evidence that the neural circuitry in the parietal cortex involved with generating alpha oscillations plays a causal role in determining whether or not a visual stimulus is successfully perceived.
We have developed methods for estimating 3D human body shape directly from images using deep convolutional neural networks and an expressive, statistical model of human body shape. Our formulation allows us to integrate a CNN with the machinery of a body model that captures accurate body shape in a variety of poses. This body model can be integrated into existing graphics pipelines, and our methods thus allow direct vision-to-graphics conversion for a wide range of human body shapes and poses.
A common strategy for developing neural network models for new tasks is to take a network pre-trained on a well-studied task, such as ImageNet classification, and fine-tune it on the new task. However, this strategy requires the use of an architecture that was trained on ImageNet, which may include design decisions that are ill-suited for the target task. To overcome this limitation, we have developed methods to transfer feature representations from an ensemble of learned models to a new model using a learned basis initialization. This method gives a flexible way to build models with new architectures while still taking advantage of the features learned on large, well-characterized datasets.