A more parameter-efficient SOTA bottleneck! (2020/07)

A more parameter-efficient bottleneck for EfficientNet!

Linear Bottleneck with Efficient Channel Attention instead of Squeeze Excitation

Introduction

CNN are great blablabla… Let’s get to the point. SOTA for image classification on Imagenet is EfficientNet with 88.5% top 1 accuracy in 2020. EfficientNet comes from MobileNet V2. In this article, I introduce a combination of EfficientNet and Efficient Channel Attention (ECA) to highlight the results of the ECA paper from Tianjin/Dalian/Harbin universities.

Linear Bottleneck

EfficientNet is based on MobileNetV2. MobileNetV2 is composed of multiple blocks which are called linear bottlenecks or inverted residuals (they’re almost the same). More informations here, or in the mobilenetv2 paper.

Linear Bottleneck (cf MobileNetV2 paper)

Linear Bottleneck is a residual layer composed of one 1x1 convolution, followed by a 3x3 depthwise convolution, then finally a 1x1 convolution. It is used as a basis for EfficientNet because of its efficiency (yeah). 1x1 convolutions don’t require much parameters so do 3x3 depthwise convolutions, and they’re fast to compute.

However, linear bottlenecks don’t have an attention layer.

EfficientNet: Linear Bottleneck + Squeeze Excitation

EfficientNet solves this problem by adding an attention layer. In 2018~2019, the trending attention layer for convolutions is Squeeze Excitation (SE). The same layer is used in MobileNetV3.

EfficientNet solving the “attention problem”: linearbottleneck_se. A better representation with hard-sigmoid can be found in MobileNetV3 paper (page 3)

So it’s almost the same, 1x1conv, 3x3dwise, SE, 1x1conv, this is what powers SOTA.

SE: Provided a C channels feature map, SE produces C scalars to be multiplied with each channel. You have a feature map of shape B*C*H*W, you apply average pooling: you get B*C*1*1, then a 1x1 convolution (same as a simple linear layer multiplying with weights and summing everything with biases), an activation layer, another 1x1 and a sigmoid. You then get B*C*1*1 positive scalars to be multiplied with your B*C*H*W feature map, which effectively produces an attention over some channels of your feature map.

Problem is, SE is highly parametric. It undoubtedly enhances the accuracy but multiplying everything with everything isn’t necessary “efficient”. EfficientNetB0 has 5.3M #Params and 0.39B #FLOPS.

We reproduced Efficient Net B0 architecture with some tweaks on the hyperparameters (optimizer, augmentation..) for our usage (fast non-dataset specific learning). EfficientNet results are hard to reproduce on low/middle-end hardware (it’s also reported here). Working re-implementations of training in Pytorch include rwightman and kakaobrain’s fast-autoaugment. This is our result:

Our reproduction of EfficientNetB0 after ~50 epochs (source for red line)

As you can see, we didn’t manage to reproduce the reported accuracy with our setup, but it doesn’t really matter for this experiment.

EfficientNetECA: Linear Bottleneck + Efficient Channel Attention

ECA is an attention layer from 2019~2020. [The paper]. It’s an extremely simple attention layer which performs pretty well as you can see:

ECA layer : https://arxiv.org/pdf/1910.03151.pdf

ECA models are more parameter-efficient than SENet-50 (SEResNeXt-50) which added +10% parameters compared to ResNet-50, while ECANet-50 adds less than 1% in parameters and GFLOPS (though ECANet does reduce the FPS by 3%, while SENet reduced the FPS by 13%). This inefficiency was already reported in the SE paper.

It’s basic, ECA captured what they called “local cross-channel interaction”. Which can be translated to “Conv1D with low kernel size”.

ECA: Provided a C channels feature map, ECA produces C scalars to be multiplied with each channel. You have a feature map of shape B*C*H*W, you apply average pooling: you get B*C*1*1, then a 1D-convolution with a low kernel_size (3 for example), no bias, sigmoid, and voilà. You then get B*C*1*1 positive scalars to be multiplied with your B*C*H*W feature map, which effectively produces an attention over some channels of your feature map.

And that’s it. Now introducing: Linear Bottleneck ECA. What is it? It’s Linear Bottleneck SE (EfficientNet main block), with an ECA layer in place of the SE layer. How it performs?

Validation Accuracy we got with EfficientNetECA

Ok, it’s the same, what’s the point? Well B0 has 5,288,548 parameters. ECA-B0 has 4,652,084 parameters. We reduced the number of parameters of the SOTA architecture by 12%!

After 66.560.000 images seen (52 epochs), for one training, on our setup, B0 without any attention layers is at 72.57%, B0-SE is at 73.17% and B0-ECA is at 73.31%. This is evaluated on the whole Imagenet validation set. It’s not significant but it supports the hypothesis that ECA performs better or the same than SE, with much less parameters.

Tl;dr

You don’t use channel attention? Pffft, cringe! Use it. You use SE? Try ECA: less parameters, more efficiency. We trained EfficientNetB0 with ECA layer instead of SE and it performed similarly with 12% less parameters.

Conclusion / Further works

ECA layer seems to be a great way to improve all models which don’t use an attention layer, or which use SE in terms of parameter-efficiency.

We don’t have the hardware to train EfficientNet up to it’s reported accuracy but we hope someone will try training a NoisyStudent-FixEfficientNetECA-L2. We’ll do it by ourselves if we manage to find the gpu-day for this usage, or we wish that the ECA team will try it (they already published the results and their github for ECA_MobileNetV2), stay tuned!

_____________________________________

More plots

Partial Top1Acc = f(n_images_seen) — with curve smoothing

We can see that SE seems to perform better at the beginning of training (in terms of iteration), even though a complete evaluation on the whole set after 52 epochs puts ECA ahead.

___

Partial Top1Acc = f(time) — with curve smoothing

In terms of training speed there’s no clear winner, it looks like SE performs better at the end of the training.

Addendums

Addendum 1 (2020/07/22): YoloV4 (SOTA or almost in early 2020) uses CBAM/SAM, which according to the ECA paper is worst than ECA. It looks like using ECA instead of SAM could improve the current SOTA in both image classification and object detection. ECA paper showed that ECA improves RetinaNet and Faster R-CNN.

Addendum 2 (2020/07/24): Of course accuracies would need to be evaluated more properly, ideally by averaging it on multiple trainings on a completely working reproduction of the SOTA. This article is just a quick ablation study, it’s not as rigourous as research papers.

Addendum 3 (2020/07/25): It’s worth noting that attention layers won’t improve your network massively. SE improved Resnet50 accuracy by 1.51% (in absolute value) and ECA improved it by 2.28%. It’s an easy improvement but on a custom dataset with a custom network you’ll probably see more improvements by modifying other hyperparameters.

Addendum 4 (2020/07/28): +plots, +code

Addendum 5 (2020/10/12): Other results could indicate that ECA, while being more parameter-efficient and still better than no attention, could have a lower score than SE on some trainings.

Code

____________________________________________

This article was written on Medium, by Elie D, for https://www.hyugen.com/en. We are working on AGI.

Support us 💙

You can follow us on Twitter to show support and to see all news (no tweet-spam, only news), or here on Medium.

Date: 2020/07/21

Return to Blog
All our articles are available on Medium

News

  • Card image
    Transformers in Pytorch from scratch for NLP Beginners

    Everything you need in one python file, without extra libraries Two weeks ago, I wanted to understand Transformers. I read the original paper, I read articles I could find online, I listened to podcas...

    Wed, 17 Feb 2021 21:12:46 GMT

    Read

    Wed, 17 Feb 2021 21:12:46 GMT

    Read
  • Card image
    Why do we close nuclear reactors?

    Nuclear reactors may be closed for four main reasons: they reached their end of life, they had an accident, they had technical problems and couldn’t be repaired, or a political decision made them cl...

    Sun, 02 Apr 2023 22:23:52 GMT

    Read

    Sun, 02 Apr 2023 22:23:52 GMT

    Read
  • Card image
    How many people died because of the Chernobyl disaster?

    Several studies and organizations investigated deaths related to the Chernobyl accident. I present their results. This article is part of a series on the DEC Report. The DEC report is a 200+ pages fr...

    Sun, 02 Apr 2023 18:29:33 GMT

    Read

    Sun, 02 Apr 2023 18:29:33 GMT

    Read
  • Card image
    Energy, EROI and limits to growth

    What is the limit to the amount of energy we can produce? Is EROI the best metric for future constraints? Let’s see! In this article, I provide a simple algorithm to evaluate if a strategy is credi...

    Sun, 02 Apr 2023 14:07:27 GMT

    Read

    Sun, 02 Apr 2023 14:07:27 GMT

    Read
  • Card image
    How much fossil fuel do we consume each year?

    Can we really grasp how much fossil fuels we consume each year? Is it a lot? Not that much? Can we easily do an energy transition for climate change? Or is our consumption of fossil fuel so fundamenta...

    Sun, 02 Apr 2023 11:58:25 GMT

    Read

    Sun, 02 Apr 2023 11:58:25 GMT

    Read
  • Card image
    How I assessed the global potential of nuclear energy

    This article is part of a series on the DEC Report. The DEC report is a 200+ pages freely accessible report I wrote on climate change and energy. It assesses the world’s potential to tackle climate ...

    Sat, 01 Apr 2023 20:08:49 GMT

    Read

    Sat, 01 Apr 2023 20:08:49 GMT

    Read
  • Card image
    How I evaluated the world’s potential for wind energy

    This article is part of a series on the DEC Report. The DEC report is a 200+ pages freely accessible report I wrote on climate change and energy. It assesses the world’s potential to tackle climate ...

    Sat, 01 Apr 2023 17:46:21 GMT

    Read

    Sat, 01 Apr 2023 17:46:21 GMT

    Read
  • Card image
    How I evaluated the world’s potential for solar energy

    This article is part of a series on the DEC Report. The DEC report is a 200+ pages freely accessible report I wrote on climate change and energy. It assesses the world’s potential to tackle climate ...

    Sat, 01 Apr 2023 16:28:55 GMT

    Read

    Sat, 01 Apr 2023 16:28:55 GMT

    Read
  • Card image
    How I evaluated the world’s potential for hydroelectricity

    This article is part of a series on the DEC Report. The DEC report is a 200+ pages freely accessible report I wrote on climate change and energy. It assesses the world’s potential to tackle climate ...

    Sat, 01 Apr 2023 15:09:34 GMT

    Read

    Sat, 01 Apr 2023 15:09:34 GMT

    Read
  • Card image
    How I built an AI Image Enhancer

    What is a great image? It’s just the previous image with more contrast and saturation, right (/s) ? But at some point, the contrast or saturation is too high. And what if the creator of the image d...

    Sat, 01 Apr 2023 02:06:01 GMT

    Read

    Sat, 01 Apr 2023 02:06:01 GMT

    Read
  • Card image
    [Video] A Model for Language Acquisition

    In this video I introduce the prototype for language acquisition in the global Artificial General Intelligence project. https://medium.com/media/9a72c93624362c8105f1406f16ee1817/href The model I used...

    Wed, 26 Jan 2022 22:54:31 GMT

    Read

    Wed, 26 Jan 2022 22:54:31 GMT

    Read
  • Card image
    Simulons des pandémies

    Propagation d’un virus dans une population Cet article a pour but de transmettre un retour d’expérience sur la simulation de pandémies. Objectif? Comprendre les limites et l’intérêt de ces a...

    Thu, 22 Oct 2020 11:56:53 GMT

    Read

    Thu, 22 Oct 2020 11:56:53 GMT

    Read
  • Card image
    Neural Network in C++ From Scratch and Backprop-Free Optimizers

    In this article I’ll present a beginner-oriented framework implementing neural networks in C++. The main goal of this code is to understand the root of neural networks for beginners, it also allows ...

    Tue, 13 Oct 2020 23:03:00 GMT

    Read

    Tue, 13 Oct 2020 23:03:00 GMT

    Read
  • Card image
    A more parameter-efficient SOTA bottleneck! (2020/07)

    Linear Bottleneck with Efficient Channel Attention instead of Squeeze Excitation CNN are great blablabla… Let’s get to the point. SOTA for image classification on Imagenet is EfficientNet with 88....

    Sat, 25 Jul 2020 19:40:43 GMT

    Read

    Sat, 25 Jul 2020 19:40:43 GMT

    Read
  • Card image
    Visuels de l’apprentissage des réseaux de neurones

    Modification des représentations internes d’un réseau de neurones en cours d’entrainement 1. Le formalisme de l’apprentissage automatique L’apprentissage automatique est la science regroupan...

    Wed, 03 Jun 2020 19:07:53 GMT

    Read

    Wed, 03 Jun 2020 19:07:53 GMT

    Read
  • Card image
    Construire un serveur de Deep Learning en 2020

    [UPDATE 2020/10/03: Prise en compte des nouveaux GPUs de Nvidia] L’intelligence artificielle, à travers l’apprentissage profond, est une discipline bien établie. Les algorithmes utilisés progr...

    Sat, 16 May 2020 12:43:19 GMT

    Read

    Sat, 16 May 2020 12:43:19 GMT

    Read