Search and Find

Service

Information & Contact

Embedded Systems - ARM Programming and Optimization

of: Jason D. Bakos

Elsevier Textbooks, 2015

ISBN: 9780128004128 , 320 Pages

Format: PDF

Copy protection: DRM

Embedded Systems - ARM Programming and Optimization

Chapter Overview
Short Description
Table of Contents
Extract
Look Inside
About eBooks

Chapter 2

Multicore and data-level optimization

OpenMP and SIMD

Abstract

Embedded processors share many things in common with desktop and server processors. Like desktop and server processors, mobile embedded systems are comprised of multiple processors but code must be explicitly written to utilize all available processors. Also like desktop and server processors, each processor contains a feature that allows a single instruction to process multiple elements of data, but this also generally requires specific code features to use. Unlike desktop and server processors, embedded processors cannot automatically execute instructions in parallel unless the instructions appear in a favorable order in the software. Together, these aspects of program design can have a substantial performance impact for computationally expensive applications.

This chapter introduces how various program structures affect the degree to which a program can utilize critical system resources such as functional units and memory bandwidth. For each of these, the chapter describes how code optimizations incorporated into the program code can recover lost performance. Understanding how to write and evaluate these types of optimizations is becoming increasingly important for embedded software. Traditional multimedia algorithms such as video decoding are based on well-refined standards and rarely change, but in modern times users have come to expect increasingly advanced algorithms for advanced image processing, such as panoramic image stitching, augmented reality, facial recognition, and object classification. These algorithms are computationally demanding, and often their practicality depends on how efficiently they can be implemented.

Keywords

Multicore

OpenMP

SIMD

Data level parallelism

Instruction level parallelism

Instruction scheduling

ARM NEON

ARM VFP

Floating point

Chapter Outline

2.1 Optimization Techniques Covered by this Book 50

2.2 Amdahl's Law 52

2.3 Test Kernel: Polynomial Evaluation 53

2.4 Using Multiple Cores: OpenMP 55

2.4.1 OpenMP Directives 56

2.4.2 Scope 58

2.4.3 Other OpenMP Directives 62

2.4.4 OpenMP Synchronization 63

2.4.4.1 Critical Sections 63

2.4.4.2 Locks 64

2.4.4.3 Barriers 65

2.4.4.4 Atomic Sections 65

2.4.5 Debugging OpenMP Code 66

2.4.6 The OpenMP Parallel for Pragma 68

2.4.7 OpenMP with Performance Counters 70

2.4.8 OpenMP Support for the Horner Kernel 71

2.5 Performance Bounds 71

2.6 Performance Analysis 73

2.7 Inline Assembly Language in GCC 74

2.8 Optimization #1: Reducing Instructions per Flop 76

2.9 Optimization #2: Reducing CPI 79

2.9.1 Software Pipelining 81

2.9.2 Software Pipelining Horner's Method 84

2.10 Optimization #3: Multiple Flops per Instruction with Single Instruction, Multiple Data 92

2.10.1 ARM11 VFP Short Vector Instructions 94

2.10.2 ARM Cortex NEON Instructions 97

2.10.3 NEON Intrinsics 100

2.11 Chapter Wrap-Up 101

Exercises 102

Desktop and server processors contain many features designed to achieve maximum exploitation of instruction level parallelism and memory locality at runtime, often without regard to the cost of these features in terms of chip area or power consumption. Their design places highest emphasis on superscalar out-of-order speculative execution, accurate branch predication, and extremely sophisticated multilevel caches. This allows them to perform well even when executing code that was not written with performance in mind.

On the other hand, embedded processor design emphasizes energy efficiency over performance, so designers generally forego these features in exchange for on-chip peripherals and specialized coprocessors for specific tasks. Because of this, embedded processor performance is more sensitive to code optimizations than desktop and server processors. Code optimizations include any features of the program code that is specifically designed to improve performance. Code optimizations can be added by the compiler, a tool that automatically transforms code, or the programmer, and can be processor agnostic, such as eliminating redundant code, or processor specific, such as substituting a complex instruction in place of a sequence of simple instructions.

Conceptually, the process of optimizing code often involves starting with a naïve implementation, a serial but functionally correct implementation of a program. The programmer must then identify its kernels, or portions of code in which the most execution time is spent. After this, the programmer must identify the performance bottleneck of each kernel, and then transform the kernel code in a way that improves performance without changing their underlying computation. These changes generally involve removing code redundancy, exploiting parallelism, taking advantage of hardware features, or sacrificing numerical accuracy, precision, or dynamic range in favor of performance.

2.1 Optimization Techniques Covered by this Book

This chapter will cover two programmer-driven optimization techniques:

1. Using Assembly Language to Improve Instruction Level Parallelism and Reduce Compiler Overheads
In some situations hand-written assembly code offers performance advantages over compiler-generated assembly code. You should not expect hand-written assembly to always outperform compiler-generated code. In fact, the automatic optimizers built into modern compilers are usually very effective for integer code, but hand-written assembly is often more effective for intensive floating-point kernels.

2. Multicore Parallelism
Even server processors cannot automatically distribute the workload of a program onto multiple concurrent processor cores. The programmer must add explicit support for multicore into the program code, and is also responsible for verifying that the code is free from concurrency errors such as data races and data sharing errors. Even then, achieving high multicore efficiency is difficult, but is becoming increasingly important in embedded system programming.

The following chapters cover additional topics in program optimization, including:

1. Fixed-Point Arithmetic
Floating-point instructions are usually more costly than integer instructions, but are often unnecessary for multimedia and sensing applications. Using fixed-point representation allows integers to represent fractional numbers at the cost of reduced dynamic range as compared to floating point. Most high-level languages, including C/C++ and Java, lack native support for fixed point so the programmer must include explicit support for fixed-point operations.

2. Loop Transformations
Cache performance is associated with a program's memory access locality, but in some cases the locality can be improved without changing the functionality of the program. This usually involves transforming the structure of loops, such as in loop tiling, where the programmer adds additional levels of loops to change the order in which program data is accessed.

3. Heterogeneous Computing
Many embedded systems, and even systems-on-a-chip, include integrated coprocessors such as Graphical Processor Units, Digital Signal Processors, or Field Programmable Gate Arrays that can perform specific types of computations faster than the general purpose...

All prices incl. VAT