Welcome | Publications | Research | Industry | Teaching | Biography
UofT Logo (Small)
Mahdi Marsousi
Academic Affiliation:
Multimedia Processing Lab, 
Electrical and Computer Eng. Dept., 
University of Toronto,
Toronto, ON, Canada
Industrial Affiliation:
Research Engineer,
Research and Development Department,
Magna Electronics Inc.,
1 Kenview Blvd, Suite 200,
Brampton, ON, L6T 5E6, Canada
Contact Information:
Academic email:
Work email: 
Email: marsousi@gmail.com
Cell-phone: (647) 967-1585

Co-object detection and segmentation in computer vision:

Problem definition and motivation:

The tasks of automated object detection and segmentation in 2D and 3D images have been receiving an increasing attention and interest during the last years, because they both are needed in new computer-assisted solutions, such as driver-assistant system, self-driving vehicles, robotic vision, and medical diagnostic tools. Many big scale high-tech companies, like Google and Microsoft, as well as small start-up incorporations seek for machine learning and computer vision experts, aiming to develop new algorithms for reliable object detection and segmentation. The most applied methods for object detection are kernel-based support vector machine (SVM), and deep neural network classifiers. Comparably, more segmentation methods are being used for object segmentation, and for the sake of abstractness, I categorize them into region-based methods (such as graph-cut and region-growing) and deformable models (such as level-set and smart-snake).

In general, the tasks of computerized object detection and segmentation have been treated as two separated tasks, such that object detection is first applied to automatically detect and recognize objects of interest, and then, (if necessary) object segmentation is applied to automatically segment the detected objects. But, I don’t remember, while reading literature in image processing and computer vision, if I have ever found anyone asking this question whether the tasks of object detection and segmentation are really de-coupled tasks from each other or not. Let me explain it by an example:

Object detection in driver-assisting systems is becoming very important to take immediate actions either to avoid collisions or to reduce the severity of injuries in car accidents. Toward the latter goal, it is crucial to discriminate pedestrians from all other objects (such as vehicles) in acquired images from the environment surrounding the driving car. This means for example if it would be impossible to avoid hitting both a car in front and a pedestrian on the side, just relax, go straight ahead, and hit the front car! To handle variety of different situations on roads and to make immediate actions to reduce severty of car accidents, a learnt-based machine vision approach is needed to classify objects in images acquired from the surrounding environment around the driving car.

Classifiers for object detection are first trained using datasets of images containing annotated objects from different classes. Then, classifiers are being used in real scenarios to classify objects in incoming images (live-mode) into their belonging classes. However, if you have worked in such a project, you would quickly come to say that this is not as easy as explained above. In real scenarios, background image of the objects of interest are different from training dataset. This changes the texture information around the objects of interest, which usually results in reduced classification accuracy or even wrong classifications. Now, assume that we first segment objects inside an image, and then classify segmented objects into their belonging classes. This can help removing the background effect from the classification task, and can potentially improve the object detection performance. But you might ask yourself again how to segment objects before even detecting them. Some might say this requires an operator’s intervention to supervise the object detection and segmentation tasks. This answer is certainly off the track. By saying this, I do not intend to put down all efforts on object detection with the deep neural network classifier. Instead, I just want to draw your attention to my proposed algorithm for co-object detection-segmentation with the hope that it would spark a new branch of research in computer vision.


In this research, a new approach is designed to perform co-object detection and segmentation in images. In this approach, classification and segmentation processes are interacting each other in a recursive approach to iteratively improve object detection and segmentation result. The proposed approach is also capable to detect-segment multiple objects in an image. The new approach is designed based on discriminative dictionary learning (DDL) in sparse representation. By incorporating shape information into DDL, we add both texture knowledge and shape prior information in the classification task. (Hint: please note I am not providing technical details here before publishing the manuscript of this idea. Then, source codes and more technical details will be provided for the public use.)

The reason of selecting sparse representation is that the feature space of objects for classification has a very high dimensionality which makes the space to be highly sparse. It has been shown that sparse representation classifier (SRC) performs better than other classification methods for sparse problems. Also, it has been shown that discriminative dictionary learning provides a stronger discrimination between classes, compared to analytic and conventional dictionary learning methods.

In the training step of classical DDL method, patches are extracted from images, and both training and classification are performed in the patch domain. The training procedure of DLL is shown in figure below.

Training process of discriminative dictionary learning.

In the proposed approach, shape prior information is incorporated in the discriminative dictionary learning process, called shape-included DDL. The training process of shape-included DDL is shown in figure below.

Training process of shape-included discriminative dictionary learning.

In the proposed approach, patches are first extracted from input images, and classification is performed in the patch domain. Afterward, classification result is transformed into the image domain and a segmentation process is performed in the image domain. The output of segmentation process is transformed back into the patch domain, and the classification process is repeated. This process iterates until all objects in the image domain are detected, recognized, and segmented. The interaction between the patch and image domains are shown in the figure below.

Training process of shape-included discriminative dictionary learning.

Additional resources:

Download presentation slides at ICIP-2016 here.

2016 (c) Copy-right preserved, Welcome | Publications | Research | Industry | Biography