As newcomers in the world of deep learning, we are told that we can generally treat pre-trained computer vision models as ‘black boxes’, without understanding the inner workings of the models. On this page, we will compare the performance of some of the state-of-the-art computer vision models. In doing so, we gain a better mental map of what performance looks like on the cutting edge, as well as demonstrating some of the visualization tools of data analysis.
We use the state-of-the-art computer vision models from PyTorch Image Models (timm). Our benchmarks for these models were collected by Ross Wightman and come from his GitHub. Our analysis is based on Jeremey Howard’s orignal analysis.
Benchmarks
We download the benchmarks—in this case, CSV files—from Ross Wightman’s github.
We plot the benchmarks for the inference models. In our chart: - the x axis shows how many seconds it takes to process one image in a log scale - the y axis is the accuracy on Imagenet - the size of each bubble is proportional to the size of images used in testing - the color shows what “family” the architecture is from.
The chart is interactive: - hover your mouse over a marker to see details about the model - double-click in the legend to display just one family - single-click in the legend to show or hide a family.
We can easily restrict to a subset of the models to get a simpler plot. We distinguish between the covnext models trained on the 22,000 category imagenet sample, covnext_in22, vs those which haven’t been, covnext.
subset ='levit|resnetd|resnet|regnety|vgg|swin'
This function overlays a linear fit for each family.
The LeViT family models are both fast and accurate. Apparently these models are constructed using a hybrid of convolution neural networks and transformets.
The Swin family of transformers is apparently among the most accurate. It is described as a “hierarchical Transformer whose representation is computed with shifted windows.”
Speed vs parameter count
We finally compare speed vs parameter count. Often, parameter count is used in papers as a proxy for speed. However, as we see, there is a wide variation in speeds at each level of parameter count, so it’s really not a useful proxy. There is sometimes a correlation between parameter count and needed memory, but this is also not always so useful. In the following chart: - the x axis shows the parameter count in a log scale - the y axis shows the speed in seconds in a log scale.