Search and Find

Book Title

Author/Publisher

Table of Contents

Show eBooks for my device only:

 

Embedded Systems - ARM Programming and Optimization

Embedded Systems - ARM Programming and Optimization

of: Jason D. Bakos

Elsevier Textbooks, 2015

ISBN: 9780128004128 , 320 Pages

Format: PDF

Copy protection: DRM

Windows PC,Mac OSX Apple iPad, Android Tablet PC's

Price: 39,99 EUR



More of the content

Embedded Systems - ARM Programming and Optimization


 

Chapter 2

Multicore and data-level optimization


OpenMP and SIMD


Abstract


Embedded processors share many things in common with desktop and server processors. Like desktop and server processors, mobile embedded systems are comprised of multiple processors but code must be explicitly written to utilize all available processors. Also like desktop and server processors, each processor contains a feature that allows a single instruction to process multiple elements of data, but this also generally requires specific code features to use. Unlike desktop and server processors, embedded processors cannot automatically execute instructions in parallel unless the instructions appear in a favorable order in the software. Together, these aspects of program design can have a substantial performance impact for computationally expensive applications.

This chapter introduces how various program structures affect the degree to which a program can utilize critical system resources such as functional units and memory bandwidth. For each of these, the chapter describes how code optimizations incorporated into the program code can recover lost performance. Understanding how to write and evaluate these types of optimizations is becoming increasingly important for embedded software. Traditional multimedia algorithms such as video decoding are based on well-refined standards and rarely change, but in modern times users have come to expect increasingly advanced algorithms for advanced image processing, such as panoramic image stitching, augmented reality, facial recognition, and object classification. These algorithms are computationally demanding, and often their practicality depends on how efficiently they can be implemented.

Keywords

Multicore

OpenMP

SIMD

Data level parallelism

Instruction level parallelism

Instruction scheduling

ARM NEON

ARM VFP

Floating point

Chapter Outline

2.1 Optimization Techniques Covered by this Book   50

2.2 Amdahl's Law   52

2.3 Test Kernel: Polynomial Evaluation   53

2.4 Using Multiple Cores: OpenMP   55

2.4.1 OpenMP Directives   56

2.4.2 Scope   58

2.4.3 Other OpenMP Directives   62

2.4.4 OpenMP Synchronization   63

2.4.4.1 Critical Sections   63

2.4.4.2 Locks   64

2.4.4.3 Barriers   65

2.4.4.4 Atomic Sections   65

2.4.5 Debugging OpenMP Code   66

2.4.6 The OpenMP Parallel for Pragma   68

2.4.7 OpenMP with Performance Counters   70

2.4.8 OpenMP Support for the Horner Kernel   71

2.5 Performance Bounds   71

2.6 Performance Analysis   73

2.7 Inline Assembly Language in GCC   74

2.8 Optimization #1: Reducing Instructions per Flop   76

2.9 Optimization #2: Reducing CPI   79

2.9.1 Software Pipelining   81

2.9.2 Software Pipelining Horner's Method   84

2.10 Optimization #3: Multiple Flops per Instruction with Single Instruction, Multiple Data   92

2.10.1 ARM11 VFP Short Vector Instructions   94

2.10.2 ARM Cortex NEON Instructions   97

2.10.3 NEON Intrinsics   100

2.11 Chapter Wrap-Up   101

Exercises   102

Desktop and server processors contain many features designed to achieve maximum exploitation of instruction level parallelism and memory locality at runtime, often without regard to the cost of these features in terms of chip area or power consumption. Their design places highest emphasis on superscalar out-of-order speculative execution, accurate branch predication, and extremely sophisticated multilevel caches. This allows them to perform well even when executing code that was not written with performance in mind.

On the other hand, embedded processor design emphasizes energy efficiency over performance, so designers generally forego these features in exchange for on-chip peripherals and specialized coprocessors for specific tasks. Because of this, embedded processor performance is more sensitive to code optimizations than desktop and server processors. Code optimizations include any features of the program code that is specifically designed to improve performance. Code optimizations can be added by the compiler, a tool that automatically transforms code, or the programmer, and can be processor agnostic, such as eliminating redundant code, or processor specific, such as substituting a complex instruction in place of a sequence of simple instructions.

Conceptually, the process of optimizing code often involves starting with a naïve implementation, a serial but functionally correct implementation of a program. The programmer must then identify its kernels, or portions of code in which the most execution time is spent. After this, the programmer must identify the performance bottleneck of each kernel, and then transform the kernel code in a way that improves performance without changing their underlying computation. These changes generally involve removing code redundancy, exploiting parallelism, taking advantage of hardware features, or sacrificing numerical accuracy, precision, or dynamic range in favor of performance.

2.1 Optimization Techniques Covered by this Book


This chapter will cover two programmer-driven optimization techniques:

1. Using Assembly Language to Improve Instruction Level Parallelism and Reduce Compiler Overheads
In some situations hand-written assembly code offers performance advantages over compiler-generated assembly code. You should not expect hand-written assembly to always outperform compiler-generated code. In fact, the automatic optimizers built into modern compilers are usually very effective for integer code, but hand-written assembly is often more effective for intensive floating-point kernels.

2. Multicore Parallelism
Even server processors cannot automatically distribute the workload of a program onto multiple concurrent processor cores. The programmer must add explicit support for multicore into the program code, and is also responsible for verifying that the code is free from concurrency errors such as data races and data sharing errors. Even then, achieving high multicore efficiency is difficult, but is becoming increasingly important in embedded system programming.

The following chapters cover additional topics in program optimization, including:

1. Fixed-Point Arithmetic
Floating-point instructions are usually more costly than integer instructions, but are often unnecessary for multimedia and sensing applications. Using fixed-point representation allows integers to represent fractional numbers at the cost of reduced dynamic range as compared to floating point. Most high-level languages, including C/C++ and Java, lack native support for fixed point so the programmer must include explicit support for fixed-point operations.

2. Loop Transformations
Cache performance is associated with a program's memory access locality, but in some cases the locality can be improved without changing the functionality of the program. This usually involves transforming the structure of loops, such as in loop tiling, where the programmer adds additional levels of loops to change the order in which program data is accessed.

3. Heterogeneous Computing
Many embedded systems, and even systems-on-a-chip, include integrated coprocessors such as Graphical Processor Units, Digital Signal Processors, or Field Programmable Gate Arrays that can perform specific types of computations faster than the general purpose...