Low Level Programming - PDFCOFFEE.COM (2023)

Low-level C programming, assembly and execution of programs on Intel® 64 architecture — Igor Zhirkov

Low-level C programming, program assembly and execution on Intel® 64 architecture

Ígor Zhirkov

Low Level Programming: C, Assembling and Executing Programs on the Intel® 64 Architecture Igor Zhirkov Saint Petersburg, Russia ISBN-13 (pbk): 978-1-4842-2402-1 DOI 10.1007/978-1-4842- 2403 - 8

ISBN-13 (electronic): 978-1-4842-2403-8

Library of Congress Control Number: 2017945327 Copyright © 2017 by Igor Zhirkov This work is subject to copyright. All rights are reserved to the Publisher, in whole or in part, over the material, specifically the rights of translation, reprinting, reuse of illustrations, recitation, transmission, reproduction on microfilm or in any other physical form and transmission or storage of information. and retrieval, electronic adaptation, computer software or by similar or different methodology now known or hereafter developed. Trademark names, logos and images may appear in this book. Rather than using a trademark symbol with each occurrence of a trademarked name, logo or image, we use the names, logos and images only for editorial purposes and for the benefit of the trademark owner, without intending to infringe trade mark. The use in this publication of trade names, trademarks, service marks and similar terms, even if not identified as such, should not be taken as an expression of opinion as to whether or not they are proprietary. While the advice and information in this book is believed to be true and accurate as of the date of publication, neither the authors, editors, nor publisher can accept any legal responsibility for any errors or omissions that may be committed. The publisher makes no warranty, express or implied, with respect to the material contained in this document. Cover image designed by Freepik Managing Director: Welmoed Spahr Editorial Director: Todd Green Acquisitions Editor: Robert Hutchinson Development Editor: Laura Berendson Technical Proofreader: Ivan Loginov Coordinating Editor: Rita Fernando Style Editor: Lori Jacobs Composer: SPi Global Indexer: SPi Global Distributed to the worldwide book trade by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, Fax (201) 348-4505, E email[email protected]or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information about translations, send an email[email protected]or visit http://www.apress.com/rights-permissions. Apress titles can be purchased in bulk for academic, corporate or promotional use. Ebook versions and licenses are also available for most titles. For more information, see our print and electronic book bulk sales page at http://www.apress.com/bulk-sales. Any source code or other supplemental material referenced by the author in this book is available to readers on GitHub through the book's product page located at www.apress.com/9781484224021. For more detailed information, visit http://www.apress.com/source-code. Printed on acid-free paper

Summary of Contents About the Author ����������������������������������������� �����xix About the Technical Reviewer ������������������������������xxi Acknowledgments������������������������������������ ��������������������������������������������������������� ��������������������������xxiii Introduction ���������� ���������� ������������������������������������������������������������ ����������������������������xxv

■Part ■ I: Assembly Language and Computer Architecture �������������������� 1 ■Chapter ■ 1: Basic Computer Architecture��������� � ������������ ��������������� ��� 3 ■Chapter ■ 2: Assembly Language ������������������������������������ 17 ■ Chapter ■ 3: Legacy ������������������������ ����������������������������������������������� � 39 ■Chapter ■ 4: Virtual Memory���������������������������������������� �������������������������������������� 47 ■Chapter ■ 5: Pipeline Compilation ����������������� 63 ■Chapter ■ 6: Interrupts and System Calls� ����������������������������������������� ���������������������������������� 91 ■Chapter ■ 7: Calculation Models��� ������������������������������������������������������������� 101

■Part ■ II: The C Programming Language �� 127 ■Chapter ■ 8: Basics��������������������������������������� ����������������������������������������������� �� 129 ■Chapter ■ 9: System Type��������������������������������� ����� 147 ■Chapter ■ 10: Code Structure ��������������������������������� 181 ■Chapter ■ 11: Memory �������� ����������������������������������������������� ���������������������������������� 201 ■ Chapter ■ 12 : Syntax, Semantics and Pragmatics �� �������������������������������������� 221 ■Chapter ■ 13: Good Practices Code ���������������������������� ���������������������� 241

iii

■ Content Overview

■Part ■ III: Between C and Assembly 263 ■Chapter ■ 14: Translation Details s s 265 ■Chapter ■ 15: Shared Code and Object Models 291 ■Chapter ■ 16: Performance s s s 327 ■Chapter ■ 17: Multiple Subprocessing 357. 357

■Part ■ IV: Appendices������������������������������������� � ���������������������������� 397 ■Chapter ■ 18: Appendix A. Using gdb��������� ����������������������������������������������� �� ���� 399 ■Chapter ■ 19: Appendix B. Using Make���������������������������� �� �� ����������������������������� 409 ■Chapter ■ 20: Appendix C. System Calls���� ���� ����������������������������������������������� ����� 415 ■Chapter ■ 21: Appendix D. Performance Testing Information ���421 ■Chapter ■ 22: Bibliography �������������������� ����������������������������������������������� �������������� 425 Contents�������������������������������� ����������������������������������������������� �������������������������������� 429

4

Table of Contents About the Author ������������������������������������ xix About the Technical Reviewer ������������������������������xxi Thanks����������� ���� ����������������������������������������������� ���������������������������xxiii Introduction ������������������� ��������������������������������������������������������� ����������������������������xxv

■Part ■ I: Assembly Language and Computer Architecture �������������������� 1 ■Chapter ■ 1: Basic Computer Architecture��������� � ������������ ��������������� ��� 3 1.1 The main architecture �������������������������� ���������������� 3 1.1.1 Computational Model���������������������� ���� ����������������������������������������������� ��������������������������������������� 3 1.1.2 von Neumann architecture� ����������������������������������������������������������� ���������������������� 3

1.2 Evolution ����������������������������������������������������������� ����������������������������������������������������������� ����� 5 1.2.1 Disadvantages of von Neumann architecture������������������������������������ ������������������������������������������������������������ ������������ 5 1.2.2 Intel 64 Architecture�������������������������������� �������������������������������������������������������������� ���������������������������������������� 6 1.2.3 Architecture extensions � ������������������������������������������������� ������������������������������������������������� �������������� 6

1.3 Records……………………………………………………………………………… ���������������������� ����������������������������������������������� ��� 7 1.3.1 General Purpose Records������������������������������������� ����������������������������������������������� ����������������������������������������������������������� ����������������� 8 1.3.2 Other Records…………………… ………………… ……………………………… ………… ������������������������������������������ ���������������������������������������� 11 1.3.3 System Records� �������������������������������������������������������������� ������������������������������������������������������������������ ������������������������������������������������� ��������� 12

1.4 Protection rings �������������� 14 1.5 Hardware Stack������������������������������ ����������������������������������������������� ���� �������������������� 14 1.6 Summary ���������������������� ����������������������������������������������� ���������������������������������������

v

■ Content

■Chapter ■ 2: Assembly Language ����������������������������������� 17 2.1 Setting up the environment � � ����������������������������������������������� ��������� �������������������������� 17 2.1.1 Working with code examples ����� ����������������������������������������������� 18

2.2 Typing “Hello World” ������������������������������������������������ ���� ��������������������������������� �� ����������������������������������� �������������������������������������� 18 2.2.2 Program structure ����������������������������������������������� ����������������������������������������������� ���������������������������������������������������������������������� ������������������������������������������������������������� ����������������������������������������������� ��� 20

2.3 Example: Checkout Contents ����������������������������������������������� ����������������������������������������������� ������������������������������������ 23 2.3.2 Relative addressing ������ ����������������������������������������������� ����������������������������������������������� ����������� ������������� 23 2.3.3 Order of Execution����������� �������� ��������������������������������������������������������� ����������������������������������������������������������� ��������������� 24

2.4 Function calls ������� ����������� 25 2.5 Working with data������������������������ ��� ���� ���������������������������� � ��� �������� ������������������� 28 2.5.1 Endianity ������������� �� ���� ��������������������� ��� ���������������������������������� � ������������ ��������������� 28 2.5.2 Strings������������������ ����������������������������������������������� ����������������������������������������������������������� ���� ������������������ 29 2.5.3 Constant pre-calculus����� �������������� ������������������������������������������������ ��� ������������ 30 2.5.4 Pointers and different types of addressing ����������������������������������������������� ������������������������ 30

2.6 Example: Calculation of chain length ������������������������������� 32 2.7 Attribution: Input/output library ��� ����������������������������������������������� ������������������ 34 2.7.1 Self-assessment������������������������� ����������������������������������������������� ����������������������������������������������������������� ���������� 35

2.8 Abstract………………………………………………………………………… ������������������� ����������������������������������������������� ��� 36 ■Chapter ■ 3: Legacy ������������������������������ ����������������������������������������������� �������������������������������������� 39 3.1 Real mode���� ����������������������������������������������������������� ������������������������������������������������������������� ������������������������� 39 3.2 Protected Mode�� �� ����������������������������������� 40 3.3 Segmentation minimum in long mode ����������������������������������� 44 3.4 Access to parts of records��� � ������������������������������������������������� ������������������������������ 45 3.4.1 Unexpected behavior����� ��������� ������������������������������������������������ ������ ������������������� ���������������� 45 3.4.2 CISC and RISC ����� �������������������������������� �������������� ����������� ���������������������������������������� �������������������������������������������������� ���� 45 3.4.3 Explanation ������������������������������������������ ������������������������������������������������� ������������������������������������������������� ��������������� ��������������������������������� �������������������� 46

3.5 Summary……………………………………………………………………………………………………………………………… �� ������������������������������������������ 46 knots

■ Content

■Chapter ■ 4: Virtual Memory������������������������������ ����������������������������������������������� �������������������� 47 4.1 Cache��������������������������� ����������������������������������������������� ������������������������������������������� 47 4.2 Motivation ��� ����������������������������������������������� ��������������������������������������������������������� �������� 47 4.3 Address spaces ��� ���������������������������������� ������������������������������������������������� ������������������������������������������������� ������������������������������������������������� � ���������������������������� ���������� 49 4.5 Example: Access to a prohibited address����������������������� ��������������������������������������� 50 4.6 Efficiency ������ ����������������������������������������������� ����������������������������������������������� ������������ 52 4.7 Implementation ���� ����������� ����������������������������������������������������������� ����������������������������������������������� ��������������������� 52 4.7.1 Structure of the virtual address ����������������������� ������������������������������������������������ ������������������������������������������������� 53 4.7 .2 Address translation in depth ������������������������������������ ������������������ 53 4.7.3 Page sizes������������������������� ��������������� ��������������������������������� ������������������������������������������������ �������� 56

4.8 Memory Mapping ����������������������������������������������� ����������������������������������������������� ��������������������������������������������������������� 57 4.9. 1 Mnemonic names for constants� �������������������������������� ������������������������������������������������� ���� 57 4.9.2 Complete example���������������������������������������� ������������������������������������������������� �������������������� 58

4.10 Summary…………………………………………………………………………………… �������������������� �� ��������������������������� ��� 60 ■Chapter ■ 5: Build Pipeline������������������������������������� ������������������������������� 63 5.1 Preprocessor………………………… …… …………………………………………………………… ��������������������������� ������������������������������������ 64 5.1.1 Simple substitutions��� ��� ����������������������������������������������� ����������������������������������������������� ����������������������� 64 5.1.2 Substitutions with Arguments ������������� ������� ���������������������������������� 65 5.1.3 Simple Conditional Substitution�� ������������������������������������������������������������� ������������������������������������������������� ������������������������������������������������� ���������������������� 66 5.1. 4 Conditioning in the definition ������������������������������������������������� ������������������������������������������������� ������������������������������������������ 67 5.1.5 Conditioning or not Text identity ���� ������������������� ����� ��������������� 67 5.1.6 Conditioning on the type of argument���������������������� ��� ��������������������������������� ������������������������������� 68 5.1.7 Order of evaluation: Define, xdefine, Assign ���� ����������������������������������������������� �������������������������������������� 69 5.1.8 Repetition ������ ����������������������������������������������� ��������������������������������������������������������� ������������������������������ 70 5.1.9 Example: Calculation of prime numbers �� ��������������������������������� 71 5.1.10 Labels of built-in macros ……………… ………………………………………………………………… ……… ������������������������ ������������������������������������������������� ����� 72 5.1.11 Conclusion������������������ ���������������������� ���������������������������������������������������� 73

viii

■ Content

5.2 Translation………………………………………………………………………… ������������������ � ����������������������������������������������� 74 5.3 Connection ����������������������������������������������� ������������������������������������������������������������� ����������� 74 5.3.1 Executable and linkable format������������������������������� ������������������������������������������������������������� ������������������������������������������������� ������������������������������������������������� ������������������������������������������������� ������������������������������������������������� 76 5.3 .3 Executable object files ��������� ������ ������������� ������������������������������������������������� ������� ���������� 80 5.3.4 Dynamic Libraries��������������������������� ������������������������������������������������ �� ��������������������������� 81 5.3. 5 Charger ����������������������������������������������� ���������������������������������������� 85

5.4 Task: Dictionary ����������������������������������������������� ����������������������������������������������� ����������������������������������������������� ������������������������������������������������ ���������� 89 ■Chapter ■ 6: Interrupts and System Calls���������� ������ ����� �������� �������������������������������� 91 6.1 Input and Output������������ ������������������������������������������������� �������������� ���������� 91 6.1.1 TR Register and Task Status Segment���������� ���������������� ��������������������������������� �������������� �������������������������� 92

6.2 Interruptions……………………………………………………………………………… ������������������� � ���������������������������� �� 94 6.3 System calls �������������������������������������������� �� ��������������������������� ����������������������������������������������� ���������� 97 6.3.1 Model specific records����� ���������������������������� ����������������������������������������������������������� ������������������������������������������������������������� 97 6.3.2 syscall and sysret ������������������������������������������������� ������������������������� 97

6.4 Summary………………………………………………………………………………… �������������������� �� ��������������������������� ����� 99 ■Chapter ■ 7: Calculation Models������������������������������������ �� �������������������������������� 101 7.1 Finite State Machines��������� 101 7.1. 1 Definition…………………………………………………………………………………… ������������������� ����������������������������������������������� ��������������������������� 101 7.1.2 Example: Bit Parity�������������� ����������������������������������������������� ����������������������������������������������������������� ������������������������������������������������������������ ���������������������������������� 103 7.1.3 Implementation in assembly language���� ����������������������������������������������������������� ����������������������������������������������� ������������������������� 103 7.1.4 Practical value ����� ������������� ������������������������������������������������� ������������������������������������������������ ������� 105 7.1.5 Regular Expressions������������������������������������� ������������������������������������������������� ����������������������������� 106

7.2 Fourth Machine ����������������������������������������������� ����������������������������������������������� ����������������������������������������������� ����������������������������������������������������������� ���������������������������������������������� ������������������������������������������������������������� ������������������������������������������������� �������������� ��� 109 7.2.2 Monitoring an exemplary Forth program�� ���������������������� ���������������������������������������������������������� ������������������������������������������������� ������������������������������������������������� ������� 112 7.2.4 How words are implemented����������������������������������� ����� �������������������������������� ����� ���������������������� 112 7.2.5 Compiler����������������� �������������������������������������������� ��������������������������������������������������������� �������� ������� 117 viii

■ Content

7.3 Task: Fourth compiler and interpreter ������������������ 118 7.3.1 Static dictionary, interpreter ������������������ ����������������������������������������������� �������������������������������������� 118 7.3.2 Compilation ������ ����������������������������������������������� ���������������������������������������������� �������������������������������� 121 7.3.3 Forward with Bootstrap ��������� ������������������������������������������������������������� ���������������������� 123

7.4 Summary………………………………………………………………………… ������������������� ����������������������������������������������� 125

■Part ■ II: The C Programming Language �� 127 ■Chapter ■ 8: Basics��������������������������������������� ����������������������������������������������� �� 129 8.1 Introduction����������������������������������� �������� ����������������������������������������������� �������������� 129 8.2 Program structure������������������������������ ����������������������������������������������� ���������������������� 130 8.2.1 Types of data ���������� ����������� ������������������������������������������������������������ ������������������������������������������������������������� ������������������ 132

8.3 Control flow ������� ���������� 133 8.3.1 yes ���� ��������������������������������� ������������������������������������ ��������� �������������������������������������������������������� �� ������������������������������������������������ ����������������������������������������������� ������������������������������������������������������������� ������������������������������������������������� ������������������������������������������������ ������� 135 8.3.4 go to������������������������������������� ������������������������������������������������� ������������������������������������������������� ������������������������������������������ 136 8.3.5 switch � ������������������������������������������������� ������������������������������������������������ ������������������������������������������������� ������������������������������������������������� ��������������� ���� 137 8.3.6 Example o: Divisor� ������������������������������������������������������������� ����������� 138 8.3.7 Example: Is it a Fibonacci number? �������������������������������������������������������������������� ��������������� 138

8.4 Affirmations and Expressions ������������������������������������������������������������ ������� 139 8.4.1 Types of declarations��� �������������������������������� ����������������������������������������������������������� ���������������������� ������������� 8.4.2 Construction Expressions ������������������������������ 141

8.5 Functions……………………………………………………………………………… ������������������� ����������������������������������������������� 142 8.6 Preprocessor ��������������������������������� ����������������������������������������������� ����������������������������������������������� ����������������������������������������������������������� 144 8.7 Summary ������������������������������������������������������������ ����������������������������������������������� ��� 146 ■Chapter ■ 9: Type System ������������������������������������� ������������������������������������������������������������������ ������������������� 147 9.1 Basic Type C System ��� ��� ������������������� �������������������������������������������������������������������� ������������������ 147 9.1.1 Numerical Types��� ���� ������������������ ������������������������������������������������� ������������������������������������������������ ��������� 147 9.1.2 Type of Foundry ����� ������ ��������������� ������������������������������������������������������������������ ������������������������������������������������� �������� 149 9.1.3 Boolean type ���������������������������������� �������������� ����������������������������������� � �� 150 9.1.4 Implicit conversions �� ������������������������������������������������ ������������������������������������������� ���� ��������������������������������� ������� 150 vii

■ Content

9.1.5 Pointers �� ���������� ���������������� 151 9.1. 6 matrices ���� 153 9.1 .7 Matrices as function arguments ����������������������������������� �������������� 153 9.1.8 Initializers designated in arrays �������������� �������������� ����������������������������������������������� ������������������������ 154 9.1.9 Alias ​​Type ������������������ ������������������������������������������������� ���������������� 155 9.1 .10 Review of the Main Function�������������������������� ������������������������������������������������� ����������� ���������� �������������������� 156 9.1.11 Operator size� �� ���������������������������������� ������������������������������������������������� ����������� ��������� 157 9.1.12 Constant rates����������������� ������������������������������������������������� �������������������������������������������������� ������������������������� 158 9.1.13 Strings������������������� ������������������������������������ ������������������������������������������������� ������������������������������������������������ �������������������������� 160 9.1.14 Functional Types ����������������� ��� ������������� ���������������������������� ������������������������������������������������� ����� �������������������������������� ���� ��������������������������������� �� ��������������������������� ����������������������������������������������� ��������������������������������������� .16 Assignment: Scalar Product������������������������������� �������������������������������������������������� 166 9.1.17 Task: Prime Number Checker������������������������������������� ��������������������� 167

9.2 Labeled types ������ ���������� 167 9.2.1 Structures �������������������������� ��� ���������������������������������� ����������������������������������������������� ����� ������� 167 9.2.2 Joints�������������������������������� ����������������������������������������������� ����������������������������������������������������������� ����� 169 9.2.3 Anonymous Structures and Unions������������������������������������ ������������������������������������������������� ��������������� 170 9.2.4 Enumerations ������� ����������������������� ������������������������������������������������� �������������������������������������������������� �� 171

9.3 Data types in programming languages ����� �������������������������������� ��� ������������� ������������� 172 9.3.2 Polymorphism��������������� ��� ��������������������������������� ���������������������������������������������� ������������������� 174

9.4 Polymorphism in C ����������������������������������������������� �� 175 9.4.1 Parametric polymorphism ����������������������������������������������� �������������������������� 175 9.4.2 Inclusion������������������ ����������������������������������������������� ����������������������������������������������������������� ��������������������� 177 9.4.3 Overload���� ����������� ������������������������������������������������������������� �������������������������������������������������������������� ������ ������������� 178 9.4.4 Restrictions ������������� ������������� ������������������������������������������������������������� ����������������������������������������������� 179

9.5 Abstract……………………………………………………………………………… ������������������� ����������������������������������������������� � 179 ■Chapter ■ 10: Code Structure�������������������������������������� ���������������������������������� 181 10.1 Declarations and definitions�� ������� ������������������������������� 181 10.1.1 Function declarations� ��� � ����������������������������������������������� ����������������������������������������������� ��������� 182 10.1.2 Structure declarations������������������������� ��������� ����������������������������������������������� ���������������������������� 183x

■ Content

10.2 Accessing code from other files ����������������������������������������������� ����������������������������������������������� �������� 184 10.2.2 Data in other files���������������������������������� ������������������������������������ 185 10.2 .3 Header files���� ����������������������������������������������������������� ������������������������������������������������������������� ������������������������� 187

10.3 Standard Library ����������������������������������������������� ����������������������������������������������� ���� ���������������������� 190 10.4.1 Include protection������������������ ����������������������������������������������� ����������������������������������������������� ��� �������������� 192 10.4.2 Why is the preprocessor bad? ����������������������������������������������� ������������������������� 194

10.5 Example: Sum of a dynamic array of dynamic memory allocation 195 10.5. 2 Example ������������������������������������������������ ������������������������������������������������� ������� 195

10.6 Task: Linked list �� 197 10.6.1 Task ��������������������������� ����������������������������������������������� ������������������������������ 197

10.7 The static keyword ����������������������������������������������� ����������������������������������������������� �������������������������������������������������������� ����������������������������������������������� ������������������������������������������� 200 ■Chapter ■ 11: Memory... �������������������������������������������� 201 11.1 Points revised ����������������������������������������������� �������� 201 11.1.1 Why do we need pointers?� ������������������������� ������ ����������������������������������������������������������� ����������������������������������������������� ����� �������� 201 11.1.2 Pointer arithmetic�� ������������ ���������������� ������������������������������������������������� ������������������� ������������������ 202 11.1.3 Type of vacuum*���� ������������� ����������������������������������������������� ����������������������������������������������� ��������������� 203 11.1.4 NULL����������������������������� ����������������������������������������������� ��������������������������������������������������������� ���������������� 203 11.1.5 A word about ptrdiff_t��� ������������ ����������� ������������������������������������������������������������� ������������������������������������������������ �������������������������� 204 11.1.6 Function Pointers ��������������� ����������������������������������������������� ������������������������������������������������� ���������������������������������� 205

11.2 Memory Model ������� 206 11.2.1 Memory Allocation ������������������������������������������������������� ����������������������������������������������� �������������������������������������� 207

11.3 Arrays and Pointers �������� 209 11.3.1 Syntax Details������������������������������� ��������������������������������������� ����������������������������������������������� �������������������� 210

11.4 String Literals ������� ������� 211 11.4.1 String sting����������������������������������������������� � �� ������������������������ ����������������������������������������������� ���� 213 xi

■ Content

11.5 Data Models ����������������������������������������������� ����������������������������������������������� ��������������������������� 215 11.7 Assignment: functions and higher order lists ������� ����� ����������������������������� 217 11.7. higher order ����������������������������������������������� �������� 217 11.7.2 Assignment������������������������������������� ����������������������������������������������������������� ������������������������������������ 218

11.8 Summary………………………………………………………………………………… ������������������� � �������������������������������������� 220 ■Chapter ■ 12: Syntax, Semantics and Pragmatics ? ���������������������� 221 12.2 Syntax and formal grammars ���������������������� ������������������������������������������������ ������������������������������������������������ ������������� 222 12.2.1 Example: Natural Numbers� ����� ������������������������ ������������������������������������������������� ������������������������������ 223 12.2.2 Example: Simple Arithmetic����������� ������������������� ������������������������������� ���������������� ��� ���������������������������� 224 12.2 .3 Recursive Descent ������ ������ ���������������� ������������������������������������������������� ���������� ������������������� 224 12.2.4 Example: Arithmetic with Priority s ���������������������������� �������������������������������������� 227 12.2.5 Example: simple imperative language � ��� 229 12.2.6 Chomsky hierarchy 229 12.2. 7 Abstract syntax tree � ���������������������������� � ����������������� 230 12.2.8 Lexical analysis�������������� ��� ��������� ����������������������������������������������� ����������������������������������������������� �������������� 231 12.2. 9 Summary of the Analysis ����������������������������������������������� ���������������������� 231

12.3 Semantics………………………………………………………………………………… ������������������ ������������������������������������� .1 Undefined behavior ��������������������������������������������������������� �������������������������������������������������������������������� ��������� 232 12.3.2 Unspecified behavior� ����� ���������� ����������������� ������������������������������������������������ ������������������������������������������������� �������������������������� 233 12.3.3 Implementation Defined behavior���������������� ������������������������������������������������ ������������������������������ 234 12.3.4 Sequence points ������������ ������������������������������������������������� ����������������������������������������������������������������� 234

12.4 Pragmatics…………………………………………………………………………… ����������������������������������� ������������������������������������� .1 Alignment ����������������������������������������������� ����������������������������������������������� ������������������������������� 235 12.4.2 Completing the data structure�� ����� ����������������������������������������������� ����������������������������������������������������������� ������������������������������������ 235

12.5 Alignment in C11 ����������������������������������������������� ����������������������������������������������� ����������������������������������������������� ����������������������������������������������������������� ����������������� 239

XI

■ Content

■Chapter ■ 13: Good Code Practices………………………………………………………………… ���������������� ������������� 241 13.1 Decision making ������������������������������� � ����������������������������������������������� ��������������������������������������������������������� ��� 241 13.2 Code Elements����������������������������������������� ����������������������������������������������� ��������������������������������������������������������� �� 242 13.2.1 General Name ������������������������������ ���������������������������������� �������� from the file ������������������������������������������������������������� ����������������������������� 243 13.2.3 Types ������������ ������������������������������������������������������������������ �������������������������������������������������������������������� ������������������������������� 243 13.2.4 Variables������������ ������������������������������������������������� ��������������������� ������������������������������������� ��������� ���� 24 4 13.2.5 About Global Variables ����������������������� ������������� ��������������������������������������������� .6 Functions���������������������������������� ����������������������������������������������� ������������������������������������� 246

13.3 Files and Documentation ��������������������������������������������������������������������� 248 13.5 Immutability ……… ……………………………………………………………………………………… ������������������� ���� ��������������������������������� �� ��������������� 251 13.6 Assertions ������������������� ����������� ������������������������������������������������������������ ������������������������������������������������� �������������� 251 13.7 Error handling������������������������������� ������������������������������������������������� ������������ ���������������� 252 13.8 About memory allocation ���������������� ������������������������������������������������� ������������� ��������������� 254 13.9 About flexibility ����������������� ����� �������������������������������� �� ������������� ���������������������� 255 13.10 Task: image rotation ���� � ����������������������������� 256 13. 10.1 BMP file format ������������������������������������������������� �������������� ���������������� 256 13.10.2 Architecture������������� ������������������������������������������������� ������������������������������������������������ ������� ���������������������� 258

13.11 Allocation: custom memory allocator ���� ����������������� �������

■Part ■ III: Between C and Assembly ������� 263 ■Chapter ■ 14: Translation Details ��������������������������������������������� ... function... ����������������������������������� 265 14.1.1 XMM Records ������� ����������������������������������������������� ����������������������������������������������������������� ������ ����������������� 265 14.1.2 Calling convention ������ �� ����������� ����������������������������������������������� �������������������������������������������������������������� ����� 266 14.1 .3 Example: simple function and its stack ����������������� 268 14.1.4 Red Zone ������ ��� � ����������������������������������������������� ������������������������������������������������������������� ������� 271 14.1.5 Variable number of arguments��������������������� �������������� ������������������������������������������������������������������ ������������������������������������������������ ���� ����������� 271 14.1.6 vprintf and friends ������������������������������� ����������������������������������������������� ����������������������������������������������� ������������������������������ 273

■ Content

14.2 volatile ����� ������������������������������������������� 273 14.2.1 Deferred memory allocation ������������������ �� ���������� ���������������� ����������������������������������������������� ����������������������������������������������� ����������������������������������������������� ����������������������������������������������� ������� 274

14.3 Nonlocal Jumps–setjmp 276 14.3 .1 Volatile and setjmp 277

14.4 online ��������������������������������������������������������������������������������������������������������������������������� ��� ��������������������������������������������� ����������������������������������������������� �� ��� 281 14.6 Aliases estritos�� ������������������������������������� ����������������������������������������������� ����������������� �� 283 14.7 Problemas de segurança��������������������������� ����������������������������������������������� ������������������ ��������� 284 14.7.1 Estouro do buffer de pilha ������������������ ���������������������������������������������� �� ���������������������������������������������� �� ���������� 284 14.7.2 volta para libc� ��������������������������������� ������������������������������������������������ ������������������������������������������������ ����� 285 14.7.3 Vulnerabilidades do formato de saída �������������������������������������������� ��������������������������������������������� ��� ������������ ��������������� 285

14.8 Protection Mechanisms Security Cookie ����������������������������������������������� ��������������������������������� 287 14.8 .2 Address Randomization Space Arrangement ���� ����������������������������������������������� �� 288 14.8.3 DEP ����������������������������������������������� ����������������������������������������������� �������������� 288

14.9 Summary…………………………………………………………………………………… �������������������� � ■ 15: Shared objects and code templates ������������������������������������ ����������� 291 15.1 Dynamic load�������������������������������� �� ����������������������������������������������� ����������������������������������������������� ����������� 291 15.2 Changes and PIC��������������������������������� ����������������������������������������������� ������������������������������������������� 293 15.3 Example: Library dynamics in C �������������������������������������� 293 15.4 GOT and PLT���� ������������������������������������������������������������ ������������������������������������������������������������������ ������ 294 15.4.1 Access to external variables� �������������������������������� �������������������������������� 294 15.4.2 Calling external functions ��������� ���������������������� ��������������� �������������������� 297 15.4.3 Example of PLT�������� ��������������� ��� �������������������������������� �������������������������������������������������� ���������� �� ����������������� 299

15.5 Preload………………………………………………………………………………… �������������������������������������� ��� ��������������������������������� � ����������������������������������������������������������� ��� ���������������������������������� ����������������������� 302 15.7 Examples……………… ……………………………… …………… …………… ������������������������������������������� ������������������������������������������������������ ������������������������������������������������������������� ������������������������������������������������ �������������� 303 15.7 .2 About several dynamic linkers ��������������������������� ������������������������������������������������� ��������������� 305 xiv

■ Content

15.7.3 Access to an external variable ����������� 306 ���������� 15.7.4 Example of complete assembly ������������ � ���������������������������� ����������������������������������������������� � 307 15.7.5 C Mixing and Assembly �������������������������������������� ����������������������������������������������� ��������������������������� 308

15.8 Which objects are linked? ��������������������������������������� 310 15.9 Optimizations����� � ����������������������������������������������� ����������������������������������������������� � 313 15.10 Code Models������������������������������������������ ����������������������������������������������� ����������� 315 15.10.1 Small code model (without PIC) ����������������������� ���� ����������������������������������������������� ���������������������������������� 317 15.10.2 Big code template (no photo) �� ����������������������������������������������� ����������������� 318 15.10.3 Average code model (without PIC)��������������������� �������������������������������������� 318 15.10. ����������������������������������������������� ����������������������������������������������� ����������������������������������������������������������������� �������������������������������������������������� ������������������ ������� �������������� 320 15.10.6 Medium PIC code model � ���������������������������� ������������������������������ 322

15.11 Abstract……………………………………………………………………………… ������������������� �������������������������������������� 324 ■Chapter ■ 16: Performance �� ����������������������������������������������� ������������������������������� 327 16.1 Optimizations���� ��������� ����������������������������������������������������������� �������������������������������������������������������������� ������������������������������������������������������������������ ����������� 327 16.1.1 Myth about fast languages������������������������������� ������������������������������������������������� ������������������������������������������������� ���������������� 327 16.1.2 General recommendations��� ��������������� ��������������������������������� �� ������������ �� ��������������� 328 16.1.3 Stack Jump Frame Pointer ������� ����������������������������������������������� ����������������������������������� 329 16.1.4 Tail recurrence���� ����������������������������������������������� ��� ������������������������������������������������ ����������������������������� 330 16.1.5 Common Subexpression Elimination ������������ ��� ����������������� 333 16.1. 6 Constant propagation ������ ������������������������� 334 16.1.7 (Name) Return value optimization��� ������������������������������������ 336 16.1. 8 Influence of deviation prediction ����������������������������������������������� �������������������� 338 16.1.9 Influence of Executing Units���������������������� ����������������������������������������������� 338 16.1. 10 Grouping of code reads and writes ��������������������������� ����������������������������������������������� �� 340

16.2 Cache ���� ���������������� 340 16.2.1 How do we use the cache effectively? ����������������������������������������������� �������������������������������������������� .2 Pre-charge 341 16.2 .3 Example : Binary Search with Presearch ��������������������������������� 342 16.2.4 Ignore Cache��������� ������������������������������������������������������������� ����������������������������������������������� ���������������������� 345 16.2.5 Example: Matrix Initialization� ������������������ ������������������������������������������������� ������������������������������������������������� ������������������ 346 xiv

■ Content

16.3 SIMD instruction class ����������������������������������������������� �������������������������� 349 16.4.1 Assignment: Sepia filter ������������� ��� ������������������������������������������������������������� ������������������������������� 351

16.5 Summary����������������������������������������������������������������������������������������������������������������� 354 ■Chapter ■ 17: Multithreading������������������������������������������������������������������������������� 357 17.1 Processes and Threads������������������������������������������������������������������������������������������� 357 17.2 What Makes Multithreading Hard?�������������������������������������������������������������������������� 358 17.3 Execution Order������������������������������������������������������������������������������������������������������� 358 17.4 Strong and Weak Memory Models��������������������������������������������������������������������������� 359 17.5 Reordering Example������������������������������������������������������������������������������������������������ 360 17.6 What Is Volatile and What Is Not������������������������������������������������������������������������������ 362 17.7 Memory Barriers������������������������������������������������������������������������������������������������������ 363 17.8 Introduction to pthreads������������������������������������������������������������������������������������������ 365 17.8.1 When to Use Multithreading�������������������������������������������������������������������������������������������������������������� 365 17.8.2 Creating Threads������������������������������������������������������������������������������������������������������������������������������� 366 17.8.3 Managing Threads����������������������������������������������������������������������������������������������������������������������������� 369 17.8.4 Example: Distributed Factorization���������������������������������������������������������������������������������������������������� 370 17.8.5 Mutexes��������������������������������������������������������������������������������������������������������������������������������������������� 374 17.8.6 Deadlocks������������������������������������������������������������������������������������������������������������������������������������������ 377 17.8.7 Livelocks�������������������������������������������������������������������������������������������������������������������������������������������� 378 17.8.8 Condition Variables���������������������������������������������������������������������������������������������������������������������������� 379 17.8.9 Spinlocks������������������������������������������������������������������������������������������������������������������������������������������� 381

17.9 Traffic lights……………………………………………………………………………… ������������������� ��� ����������������������������������� 382 17.10 How strong is the Intel 64? ����������������������������������������������� ������������������������������ 385 17.11 What is non-blocking programming? ������������������������� 388 17.12 Memory Model C11 ������ ����� ������� � ���������������������������� ����������������������������������������������� ������� 390 17.12.1 General description ������ ������������������������������ ����������������������������������������������� ����������������������������������������������� ����� 390 17.12.2 Atomic��������������������������������������� ������������������������������������������������� ��������������� ��������������� 390 17.12.3 Memory arrangements in C11��������� ������������������������������������������������� 392 17.12. 4 Operations ����������������������������������������������� ����������������������������������������������� ����������������������������������������������� �������������� 392

17.13 Summary…………………………………………………………………………………… ������������������� ����������������������������������������������� ���������������������� 394

■ Content

■Part ■ IV: Appendices������������������������������������� � ���������������������������� 397 ■Chapter ■ 18: Appendix A. Using gdb��������� ����������������������������������������������� �� ���� 399 ■Chapter ■ 19: Appendix B. Using Make���������������������������� �� �� ����������������������������� 409 19.1 Simple Makefile ���������������� ����������������������������������������������� ������������������������������������� 409 19.2 Adding Variables ������� ����������������������������������������������� ����������������������������������� 410 19.3 Automatic variables ����� ����������������������������������������������� ������������������������������ 412 ■ Chapter ■ 20: Appendix C. System Calls��������������������������������������� ������������������������ 415 20.1 reading���������������� ���� ����������������������������������������������� ����������������������������������������������� � 415 20.1.1 Argument ar ntos ����������������������������������������������� ����������������������������������������������� ����������������������� 416

20.2 escribir�������������������������������������������� �� ������������������������������������������������ ������������������� 416 20.2.1 Argumentos��������������������������� ������������������������������������������������ ����������������������������������������������� ����� ������������� 416

20.3 aberto������������������������������������������� �� ������������������������������������������������ ������������������������� 416 20.3.1 Argumentos�������������������� ���� �������������������������������������������� ������������������������������������� ���������� ��������������� 417 20.3.2 Banderas ���������������������� ������������������������������������������������ ������������������������������������� ���������������� �������������� 417

20.4 close ������������������������������������������������ ������������������������������������������������ ������������������������������������������������� ������������������������������������������������ 418

20.5mm Map ��������������������������������������������������� ���������������������������������������������������������������������������������������������������� ������������������������������������������������������������������������������� ������������������������������������������ �� ��� �� ������������������������������������������������������������������������������������� ���������������������������������������������������������������������������������������� ��������������������������������������������������������������������������������������������� ����������������������������� 419 20.5. 3 Behavior indicators ����������������������������������������������� ������������������������������� 419

20.6 munmap��������������������������������������������� ����������������������������������������������� ���� �������������� 419 20.6.1 Argumentos��������������������������� ����������������������������������������������� �������������������������������������������� �������� �������� 419

20,7 salida����������������������������������������������� ������������������������������������������������ ������������������������� 420 20.7.1 Argumentos�������������������� ����������������������������������������������� �������������������������������������������� �������� ���������������� 420

■ Chapter ■ 21: Appendix D. Performance Testing Information����������������������������� 421 ■ ■ Chapter ■ 22: Bibliography ����������������������������������������������� ��������������������������������Index 425������������� � ����������������������������������������������� ����������������������������������������������� ������ 429

xvii

About the author Igor Zhirkov teaches his successful course "Systems Programming Languages" at ITMO University in St. Petersburg, who is a six-time winner of the ACM-ICPC Intercollegiate World Championship in Programming. He studied at St. Petersburg Academic University and a master's degree from ITMO University. He is currently doing research on verified refactorings of C as part of his PhD thesis and the formalization of the Bulk Synchronous Parallelism library in C at IMT Atlantique in Nantes, France. His main interests are low-level programming, programming language theory, and type theory. His other interests include playing the piano, calligraphy, art and philosophy of science.

xix

About the technical reviewer Ivan Loginov is a researcher and professor at ITMO University of St. Petersburg, Russia (University of Information Technologies, Mechanics and Optics), and teaches the course "Introduction to Programming Languages" for undergraduate computer science students. He received his Masters from ITMO University. His research focuses on compiler theory, language benches, and parallel and distributed programming, as well as new teaching techniques and their application in IT (information technology). He is currently writing his PhD on a cloud-based modeling toolkit for system dynamics. His hobbies include playing the trumpet and reading classical (Russian) literature.

xxx

Acknowledgments I was lucky to meet a large number of very talented and extremely dedicated people, who helped me and often guided me through areas of knowledge I never imagined. Thanks to Vladimir Nekrasov, my dear math teacher, for his course and his influence on me, which allowed me to think better and more logically. I am grateful to Andrew Dergachev, who entrusted me with creating and teaching my course and helped me a lot over the years, Boris Timchenko, Arkady Kluchev, Ivan Loginov (who also kindly agreed to be the technical reviewer for this book), and all my colleagues. 🇧🇷 ITMO University, which helped shape this course in one way or another. I am grateful to all my students who have given me feedback or even helped me with teaching. You are the reason I am doing this. Several students helped revise the draft of this book. I want to note the most useful comments by Dmitry Khalansky and Valery Kireev. For me, my years at St. Petersburg Academic University are easily the best of my life. I've never had so many opportunities to study with world-class experts working at top companies alongside other students who are much smarter than I am. I want to express my deepest gratitude to Alexander Omelchenko, Alexander Kulikov, Andrey Ivanov and all who contribute to the quality of computer science education in Russia. Thanks also to Dmitry Boulytchev, Andrey Breslav and Sergey Sinchuk at JetBrains, my supervisors who taught me a lot. I am also very grateful to my French colleagues: Ali Ed-Dbali, Frédéric Loulergue, Rémi Douence and Julien Cohen. I also want to thank Sergei Gorlatch and Tim Humernbrum for providing much-needed feedback on Chapter 17 that helped me create a much more coherent and understandable version. Special thanks to Dmitry Shubin for his most helpful impact in correcting the imperfections of this book. I am very grateful to my friend Alexey Velikiy and his agency CorpGlory.com, who focused on data visualization and infographics and created the best illustrations in this book. Behind every little success of mine is an endless amount of support from my family and friends. I wouldn't have achieved anything without you. Last but not least, thanks to the team at Apress, including Robert Hutchinson, Rita Fernando, Laura Berendson, and Susan McDermott, for trusting me and this project and doing everything possible to make this book a reality.

XXIII

Introduction This book is intended to help you develop a consistent view of low-level programming mastery. We want to allow a careful reader to: • Write freely in assembly language. • Understand the Intel 64 programming model. • Write robust, maintainable code in C11. • Understand the build process and decipher build listings. • Debugging errors in compiled assembly code. • Use appropriate computer models to significantly reduce program complexity. • Write performance-critical code. There are two types of technical books: those that are used for reference and those that are used for learning. This book is undoubtedly of the second type. It is quite dense on purpose and to successfully digest the information we strongly recommend continued reading. To quickly memorize new information, you should try to connect it with information that you are already familiar with. That's why we've tried, where possible, to base our explanation of each topic on the information you've received from previous topics. This book is written for programming students, intermediate to advanced programmers, and low-level programming enthusiasts. Prerequisites are a basic understanding of binary and hexadecimal systems and a basic knowledge of Unix commands.

■■Questions and Answers Throughout this book you will find many questions. Most of them are meant to make you think again about what you've just learned, but some encourage you to do additional research by pointing out relevant keywords. We propose the answers to these questions on our GitHub page, which also hosts all the lists and starter code for tasks, updates, and more. See the book page on the Apress website for additional information: http://www.apress.com/us/book/9781484224021. There you will also find several pre-configured virtual machines with Debian Linux installed, with and without a graphical user interface (GUI), allowing you to start practicing right away without wasting time configuring your system. You can find more information in section 2.1. We start with the very simple core ideas of what a computer is, explaining computer model and computer architecture concepts. We expand the main model with extensions until it is adequate enough to describe a modern processor as seen by a programmer. From Chapter 2 onwards, we start programming in real assembly language for Intel 64 without resorting to older 16-bit architectures, which are often taught for historical reasons. Allows you to see interactions between applications and the operation

xiv

■ Introduction

system through the system call interface, and architecture-specific details such as endianness. After a brief overview of features of the legacy architecture, some of which are still in use, we study virtual memory in great detail and illustrate its use with the help of procfs and examples of using mmap system calls in assembly. 🇧🇷 We then dove into the build process, reviewing pre-processing, static and dynamic links. After exploring interrupt and system call mechanisms in more detail, we end the first part with a chapter on different computational models, examining examples of finite state machines, stack machines, and implementing a fully functional Forth language compiler in assembly. mashed potato. The second part is dedicated to the C language. We start from the general description of the language, building a basic understanding of its computational model necessary to start writing programs. In the next chapter, we study the C type system and illustrate different types of types, ending with a discussion of polymorphism and providing exemplary implementations for different types of polymorphisms in C. We then study ways to properly structure the program by splitting it. across multiple files and also see its effect on the linking process. The next chapter is dedicated to memory management, input and output. After that, we elaborate three facets of each language: syntax, semantics, and pragmatics, and focus on the first and third. We see how language instructions are transformed into abstract parse trees, the difference between undefined and unspecified behavior in C, and the effect of language pragmatics on the assembly code produced by the compiler. At the end of Part Two, we dedicate a chapter to good coding practices to give readers an idea of ​​how to write code based on their specific requirements. The allocation sequence for this part ends with rotating a bitmap file and a custom memory allocator. The final part is a bridge between the previous two. It dives into the details of translation, such as calling conventions and stack frames, and advanced features of the C language that require some understanding of assembly, such as restricted and volatile keywords. We've provided an overview of several classic low-level errors, such as stack buffer overflows, that can be exploited to induce unwanted behavior in your program. The next chapter discusses shared objects in great detail and studies them at the assembly level, providing minimal working examples of shared libraries written in C and assembly. Next, we discuss a relatively rare topic of code templates. The chapter examines the optimizations that modern compilers are capable of and how this knowledge can be used to produce fast, readable code. We also provide an overview of performance boosting techniques, such as using specialized assembly instructions and optimizing cache usage. This is followed by a task in which you will implement a sepia filter for an image using specialized SSE instructions and measure its performance. The last chapter introduces multithreading through the use of the pthreads library, memory models and reordering, which anyone doing multithreaded programming should be aware of, and explains the need for memory barriers. Appendices include short tutorials on gdb (debugger), make (automated build system), and a table of the most commonly used system calls for reference and system information to help you reproduce the benchmarks provided throughout the book. They should be read as needed, but we recommend getting used to gdb once you start programming in assembler in Chapter 2. Most of the illustrations were produced using the VSVG library intended for producing complex interactive vector graphics, written by Alexey Velikiy (http://www.corpglory.com). Library sources and book illustrations are available on the VSVG Github page: https://github.com/corpglory/vsvg. We hope you find this book useful and we wish you a happy reading!

xxi

PART I

Assembly Language and Computer Architecture

CHAPTER 1

Basic Computer Architecture This chapter will give you a general understanding of the fundamentals of computer operation. We'll describe a basic model of computation, list its extensions, and take a closer look at two of them, namely registers and the hardware stack. It will prepare you to start programming assemblies in the next chapter.

1.1 The main architecture 1.1.1 Compute model What does a programmer do? A first guess would probably be "algorithm construction and implementation". So we have an idea, then we code it, and that's the common way of thinking. Can we build an algorithm to describe some daily routine like walking or shopping? The question doesn't seem particularly difficult and many people will be happy to provide their solutions. However, all these solutions will be fundamentally different. One will operate with actions like "open the door" or "get the key"; the other will prefer to “leave home”, omitting details. The third, however, will be dishonest and provide a detailed description of the movement of your hands and legs, or even describe your muscle contraction patterns. The reason these answers are so different is the incompleteness of the original question. All ideas (including algorithms) need a way to express themselves. To describe a new notion, we use other, simpler notions. We also want to avoid vicious circles, so the explanation will follow the shape of a pyramid. Each explanation level will grow horizontally. We cannot build this pyramid infinitely, because the explanation has to be finite, so we stop at the level of basic and primitive notions, which we deliberately choose not to expand further. Therefore, choosing the basics is a fundamental requirement for expressing anything. This means that building algorithms is impossible unless we have fixed a set of basic actions, which act as building blocks. The computational model is a set of basic operations and their respective costs. Costs are usually whole numbers and are used to reason about the complexity of algorithms by calculating the combined cost of all its operations. We will not discuss computational complexity in this book. Most models of computation are also abstract machines. It means that they describe a hypothetical computer, whose instructions correspond to the basic operations of the model. The other type of model, decision trees, is beyond the scope of this book.

1.1.2 Von Neumann Architecture Now let's imagine that we are living in the 1930s, when today's computers did not yet exist. People wanted to automate calculations in some way, and different researchers came up with different ways to achieve that automation. Common examples are Church's Lambda calculus or the Turing machine. These are typical abstract machines, describing imaginary computers. © Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_1

3

Chapter 1 ■ Basic Computer Architecture

One type of machine soon became dominant: the von Neumann architectural computer. Computer architecture describes the functionality, organization, and implementation of computer systems. It is a relatively high-level description, compared to a calculation model, that does not omit a single detail. The von Neumann architecture had two crucial advantages: it was robust (in a world where electronic components were highly unstable and short-lived) and easy to program. In short, it is a computer formed by a processor and a memory bank, connected to a common bus. A central processing unit (CPU) can execute instructions, retrieved from memory by a control unit. The Arithmetic Logic Unit (ALU) performs the necessary calculations. Memory also stores data. See Figures 1-1 and 1-2. The main characteristics of this architecture are: • The memory only stores bits (a unit of information, a value equal to 0 or 1). • Memory stores coded instructions and data to operate. There is no way to distinguish the data from the code: both are actually bit strings. • Memory is organized into cells, which are naturally labeled with their respective indices (eg cell #43 follows cell #42). Indexes start at 0. Cell size can vary (John von Neumann thought that each bit should have its address); modern computers use one byte (eight bits) as the memory cell size. So the 0th byte contains the first eight bits of memory, etc. • The program consists of instructions that are fetched one after the other. Its execution is sequential unless a special jump instruction is executed.

Figure 1-1. Von Neumann Architecture: Overview Assembly language for a chosen processor is a programming language consisting of mnemonics for every possible binary encoded instruction (machine code). It greatly facilitates programming in machine code, as the programmer does not need to memorize the binary coding of the instructions, just their names and parameters. Note that instructions can have parameters of different sizes and shapes. An architecture does not always define a precise instruction set, unlike a computing model. A typical modern personal computer evolved from early computers with von Neumann architecture, so let's investigate this evolution and see what distinguishes a modern computer from the simple schematic in Figure 1-2.

4

Chapter 1 ■ Basic Computer Architecture

Figure 1-2. von Neumann architecture—Memory

■■Note  The memory status and register values ​​fully describe the state of the CPU (from the programmer's point of view). Understanding an instruction means understanding its effects on memory and registers.

1.2 Evolution 1.2.1 Disadvantages of the von Neumann Architecture The simple architecture described above has serious disadvantages. First of all, this architecture is not interactive at all. A programmer is limited by manually editing memory and displaying its contents in some way. In the early days of computers, it was quite simple, because the circuit was big and the bits could literally be flipped by hand. Also, this architecture does not support multitasking. Imagine that your computer is performing a very slow task (for example controlling a printer). It's slow because a printer is much slower than the slowest CPU. The CPU then has to wait for a device to react for about 99% of the time, which is a waste of resources (i.e. CPU time). So when everyone can execute any kind of statement, all kinds of unexpected behavior can happen. The purpose of an operating system (OS) is (among others) to manage resources (such as external devices) so that user applications do not cause chaos when interacting with the same devices simultaneously. Because of this, we would like to prohibit all user agents from executing some instructions related to input/output or system management. Another issue is that memory and CPU performance differ drastically. In the past, computers weren't just simpler: they were designed as integral entities. Memory, bus, network interfaces - everything was created by the same engineering team. Each part has been specialized to be used on this specific model. Therefore, the parts are not meant to be interchangeable. Under these circumstances, no one tried to create a part capable of outperforming other parts because it was not possible to increase the overall performance of the computer. But as architectures became more or less stable, hardware developers began to work on different parts of computers independently. Naturally, they tried to improve their performance for marketing purposes. However, not all parts were easy and cheap to speed up. That's why CPUs soon became much faster than memory. It is possible to speed up the memory by choosing other types of underlying circuits, but it would be much more expensive [12]. 1

Consider how often the solutions engineers come up with are dictated by economic reasons rather than technical constraints.

5

Chapter 1 ■ Basic Computer Architecture

When a system consists of different parts and their performance characteristics differ greatly, the slowest part can become a bottleneck. This means that if the slower part is replaced with a faster one, the overall performance will increase significantly. That's when the architecture had to be heavily modified.

1.2.2 Intel 64 Architecture In this book, we describe only the Intel 64 architecture.2 Intel has been developing its flagship family of processors since the 1970s. Each model was created to preserve binary compatibility with older models. This means that even modern processors can run code written and compiled for older models. This leads to an enormous amount of legacy. Processors can work in several modes: real, protected, virtual mode, etc. If not explicitly specified, we describe how a CPU works in the latest long mode.

1.2.3 Architecture Extensions The Intel 64 incorporates several extensions to the von Neumann architecture. The most important ones are listed here for a quick overview. Registers These are memory cells placed directly on the CPU chip. As for the tracks, they are much faster, but they are also more complicated and expensive. Registration accesses do not use the bus. The response time is quite small and is usually equal to a few CPU cycles. See section 1.3 “Records”. Hardware Stack A stack in general is a data structure. It supports two operations: placing an element on top of it and placing the element above it. A hardware stack implements this abstraction in memory through special instructions and a register, pointing to the last item on the stack. A stack is used not only in calculations but also for storing local variables and implementing sequence of function calls in programming languages. See section 1.5 “Hardware stack”. Interrupts This function allows changing the execution order of the program based on events external to the program itself. After detecting a signal (external or internal), the execution of a program is suspended, some registers are saved and the CPU starts executing a special routine to deal with the situation. The following are examples of situations where an interrupt occurs (and an appropriate piece of code is executed to handle it): • A signal from an external device. • Zero division. • Invalid instruction (when the CPU could not recognize an instruction by its binary representation). • An attempt to execute a privileged instruction in a non-privileged mode. See section 6.2 “Interrupts” for a more detailed description. Protection rings A CPU is always in a state corresponding to one of the so-called protection rings. Each ring defines a set of allowed instructions. Ring zero allows any instruction in the entire CPU instruction set to be executed and is therefore the most privileged. The third allows only the safest ones. An attempt to execute a privileged instruction results in an interrupt. Most applications work inside the third ring to ensure they don't modify critical system data structures (like page tables) and don't work with external devices bypassing the operating system. The other two rings (the first and second) are intermediate and are not used by modern operating systems. See section 3.2 “Protected Mode” for a more detailed description. Virtual memory This is an abstraction over physical memory, which helps to distribute it between programs in a safer and more efficient way. It also isolates programs from each other. two

Also known as x86_64 and AMD64.

6

Chapter 1 ■ Basic Computer Architecture

See section 4.2 “Motivation” for a more detailed description. Some extensions (for example, caches or shadow registers) cannot be accessed directly by a programmer. We will also mention some of them. Table 1-1 summarizes information about some extensions of the von Neumann architecture seen in modern computers. Table 1-1. von Neumann Architecture: Modern Extensions

Problem

Solution

Nothing is possible without querying slow memory

records, hidden

lack of interactivity

interruptions

No support for isolating code in procedures or saving context Hardware stack Multitasking: Any program can execute any instruction

protection rings

Multitasking: programs are not isolated from each other

virtual memory

■■Sources of Information  No book should completely cover the instruction set and architecture of the processor. Many books attempt to include exhaustive information about the instruction set. It becomes obsolete very quickly; moreover, it inflates the book unnecessarily. We will frequently refer you to the Intel® 64 and IA-32 Architectures Software Developer's Handbook available online: see [15]. Get it now! There is no point in copying the descriptions of the instructions from the “original” place where they appear; it is much more mature to learn to work with the source. The second volume covers the entire set of instructions and has a very useful index. Please always use it for instruction set information - it's not only very good practice, but also a very reliable source. Note that many educational resources devoted to assembly language on the Internet are often very outdated (as few people program in assembler these days) and do not cover 64-bit mode. Instructions present in older modes often have their long-mode counterparts updated and function differently. This is one of the reasons why we strongly discourage using search engines to find instruction descriptions, however tempting it may be.

1.3 Registers The exchange of data between the CPU and memory is a crucial part of the calculations in a von Neumann computer. Instructions must be fetched from memory, operands must be fetched from memory; some instructions also store results in memory. This creates a bottleneck and leads to wasted CPU time waiting for data response from the memory chip. To avoid constant waiting, a processor was equipped with its own memory cells called registers. These are few, but fast. Programs are usually written in such a way that most of the time the working memory cell pool is small enough. This fact suggests that programs can be written so that the CPU works with registers most of the time.

7

Chapter 1 ■ Basic Computer Architecture

Registers are based on transistors, while main memory uses capacitors. We could have implemented main memory in transistors and had a much faster circuit. There are several reasons why engineers prefer other ways to speed up calculations. • Records are more expensive. • Instructions encode the registration number as part of their codes. To address more registers, instructions need to increase in size. • Registers add complexity to the circuits to be addressed. More complex circuits are harder to accelerate. It's not easy to configure a big log file to work at 5 GHz. Naturally, using the registry slows computers down. If everything has to be pulled into the registers before the calculations are done and then dumped into memory, where's the gain? Programs are usually written so that they have a specific property. It's the result of using common programming patterns like loops, functions, and data reuse, not some law of nature. This property is called locality of reference and there are two main types of it: temporal and spatial. Temporal locality means that accesses to an address are likely to be close in time. Spatial locality means that after accessing an address X, the next memory access is likely to be close to X (such as X − 16 or X + 28). These properties are not binary: you can write a program that returns a stronger or weaker locale. Typical programs use the following pattern: The working data set is small and can be held within registers. After putting the data in the registers once, we will work with them for some time, and then the results will be downloaded into memory. Data stored in memory will rarely be used by the program. In case we need to work with this data, we will lose performance because • We need to get data in the registers. • If all registers are occupied with data that we still need later, we will have to spill some of them, which means saving their contents in temporarily allocated memory cells.

■■Note  A general scenario for an engineer: Decrease performance in the worst case to improve it in the average case. It works quite often, but it is prohibited when building real-time systems, which impose restrictions on the worst reaction time of the system. These systems are required to issue a reaction to events in no more than a certain amount of time, so reducing performance in the worst case to improve it in other cases is not an option.

1.3.1 General Purpose Registers Most of the time, the programmer works with general purpose registers. They are interchangeable and can be used in many different commands. These are 64-bit registers named r0, r1, ..., r15. The first eight of them can be named alternatively; these names represent the meaning they have for some special instructions. For example, r1 is also called rcx, where c stands for "cycle". There is an instruction loop that uses rcx as a loop counter but does not explicitly accept operands. Of course, this kind of special register meaning is reflected in the documentation for the corresponding commands (for example, as a counter for the loop instruction). Table 1-2 lists them all; see also Figure 1-3.

8

Chapter 1 ■ Basic Computer Architecture

■■Note  Unlike the hardware stack, which is implemented on top of main memory, registers are a completely different type of memory. So they don't have addresses like cells in main memory! Alternative names are in fact more common for historical reasons. We will provide both for reference and give advice for each. These semantic descriptions are provided for reference; you don't need to memorize them now. Table 1-2. 64-bit general purpose registers

Name

Pseudonym

Description

r0

rax

A kind of "accumulator", used in arithmetic instructions. For example, a div statement is used to divide two integers. Accepts one operand and implicitly uses rax as the second. After executing div rcx, a large 128-bit number, stored in parts in two registers rdx and rax, is divided by rcx and the result is again stored in rax.

r3

rbx

base record. It was used for base addressing in early processor models.

r1

rcx

It is used for cycles (eg in a loop).

r2

rdx

Stores data during input/output operations.

r4

rs

Stores the address of the top element on the hardware stack. See section 1.5 “Hardware stack”.

r5

rbp

Stack the frame base. See section 14.1.2 “Call Agreement”.

r6

rs

Source index in string manipulation commands (like movsd)

r7

rdi

Target index in string manipulation commands (like movsd)

not

It appeared later. It is mostly used to store temporary variables (but is sometimes used implicitly, like r10, which saves CPU flags when the syscall instruction is executed. See Chapter 6 "Interrupts and System Calls").

r8 r9...r15

You generally don't want to use rsp and rbp registers because of their very special meaning (we'll see how they corrupt the stack and stack structure later). However, you can perform arithmetic directly on them, making them general purpose. Table 1-3 shows records sorted by name following an indexing convention. Table 1-3. 64-bit general purpose registers - different naming conventions r0

r1

r2

r3

r4

r5

r6

r7

rax

rcx

rdx

rbx

rs

rbp

rs

rdi

It is possible to address a part of a record. For each register, you can address the lowest 32 bits, the lowest 16 bits, or the lowest 8 bits. When the names r0,...,r15 are used, an appropriate suffix is ​​added to the name of a register: • d for doubleword: lower 32 bits; • w for word: minus 16 bits; • b for byte: minus 8 bits.

9

Chapter 1 ■ Basic Computer Architecture

For example, • r7b is the low byte of register r7; • r3w consists of the lowest two bytes of r3; and • r0d consists of the lowest four bytes of r0. Alternative names also allow for smaller parts to be addressed. Figure 1-4 shows the decomposition of large general purpose registers into smaller ones. The naming convention for accessing parts of rax, rbx, rcx, and rdx follows the same pattern; just change the middle letter (a to rax). The other four registers do not allow access to their lower second bytes (as rax does with the ah name). The lowest byte name differs slightly for rsi, rdi, rsp and rbp. • The smallest parts of rsi and rdi are sil and dil (see Figure 1-5). • The smallest parts of rsp and rbp are spl and bpl (see Figure 1-6). In practice, the names r0-r7 are rarely seen. Typically, programmers stick with alternative names for the first eight general-purpose registers. This is done for both legacy and semantic reasons: rsp lists much more information than r4. The other eight (r8-r15) can only be named using an indexing convention.

■■Inconsistency in writes All minor register reads act in an obvious way. However, recordings in 32-bit chunks fill the top 32 bits of the entire record with sign bits. For example, zeroing eax will zero all rax, storing -1 in eax will pad the top 32 bits with ones. Other writes (eg for 16-bit parts) act as expected: they don't affect all other bits. See section 3.4.2 “CISC and RISC” for explanation.

10

Chapter 1 ■ Basic Computer Architecture

1.3.2 Other registers The other registers have a special meaning. Some registers have system-wide significance and therefore cannot be changed except by the operating system.

Figure 1-3. Intel 64 Approach: General Purpose Registers

11

Chapter 1 ■ Basic Computer Architecture

A programmer has access to the extract log. It is a 64-bit register, which always stores an address of the next instruction to be executed. Diversion instructions (e.g. jmp) are actually modifying it. Thus, each time an instruction is executed, rip stores the address of the next instruction to be executed.

■■Note  All instructions have different sizes! Another handy register is called rflags. Stores flags, which reflect the current state of the program, for example, what was the result of the last arithmetic instruction: was it negative?, did an overflow occur?, etc. Its smaller parts are called eflags (32 bits) and flags (16 bits). 🇧🇷

■■Question 1 It's time to do some preliminary research based on the documentation [15]. See section 3.4.3 of the first volume for information on rflags records. What is the meaning of CF, AF, ZF, OF, SF flags? What is the difference between OF and CF?

Figure 1-4. rax decomposition In addition to these main registers, there are also registers used by instructions working with floating point numbers or special parallelized instructions capable of performing similar actions on several pairs of operands at the same time. These instructions are often used for multimedia purposes (they help speed up multimedia decoding algorithms). The corresponding registers are 128 bits wide and are named xmm0 - xmm15. We'll talk about them later. Some registries appeared as non-standard extensions, but were standardized soon after. These are called model-specific registers. See Section 6.3.1 "Model Specific Registrations" for more details.

1.3.3 System Logs Some logs are specifically designed to be used by the operating system. They do not contain values ​​used in calculations. Instead, they store the information required by system-wide data structures. Therefore, its role is to support a framework, born of a symbiosis between the operating system and the CPU. All applications run within this framework. The latter ensures that applications are well isolated from the system itself and from each other; it also manages resources more or less transparently to a programmer. It is extremely important that these logs are inaccessible to the applications themselves (at least the applications must not be able to modify them). This is the purpose of privileged mode (see section 3.2). We will list here some of these records. Its meaning will be explained in detail later. • cr0, cr4 storage flags related to different processor modes and virtual memory; • cr2, cr3 are used to support virtual memory (see sections 4.2 “Motivation”, 4.7.1 “Virtual address structure”);

12

Chapter 1 ■ Basic Computer Architecture

• cr8 (alias tpr) is used to adjust the interrupt mechanism (see section 6.2 “Interrupts”). • efer is another tag register used to control processor modes and extensions (eg long mode and system call handling). • idtr stores the address of the interrupt descriptor table (see section 6.2 “Interrupts”). • gdtr and ldtr store the descriptor table addresses (see section 3.2 “Protected mode”). • cs, ds, ss, es, gs, fs are called segment registers. The threading engine they provide is considered a legacy from many years ago, but a part of it is still used to implement privileged mode. See section 3.2 “Protected Mode”.

Figure 1-5. decomposition of rsi and rdi

Figure 1-6. decomposition of rsp and rbp

13

Chapter 1 ■ Basic Computer Architecture

1.4 Guard Rings Guard rings are one of the mechanisms designed to limit application capabilities for security and robustness reasons. They were invented for Multics OS, a direct predecessor of Unix. Each ring corresponds to a certain privilege level. Each statement type is tied to one or more privilege levels and is not executable under others. The current privilege level is stored somehow (for example, in a special register). Intel 64 has four privilege levels, of which only two are used in practice: ring-0 (the most privileged) and ring-3 (the least privileged). Intermediate rings were intended to be used for operating system drivers and services, but popular operating systems have not taken this approach. In long mode, the current guard ring number is stored in the lower two bits of register cs (and doubled in those of ss). It can only be changed when dealing with an interrupt or system call. Therefore, an application cannot execute arbitrary code with elevated privilege levels: it can only call an interrupt handler or make a system call. See Chapter 3 "Legacy" for more information.

1.5 Hardware stack If we talk about data structures in general, a stack is a data structure, a container with two operations: a new element can be placed on top of the stack (push); the top element can be removed from the stack (pop). There is hardware support for such a data structure. This does not mean that there is also a separate heap memory. It's just some kind of emulation implemented with two machine instructions (push and pop) and a register (rsp). The rsp register contains an address of the top of the stack. The instructions work as follows: • push argument 1. Depending on the length of the argument (2, 4 and 8 bytes are allowed), the value of rsp is reduced by 2, 4 or 8. 2. An argument is stored in the memory of the address, taken from modified rsp. • pop argument 1. The top element of the stack is copied into register/memory. 2. rsp increases according to the size of its argument. An augmented architecture is depicted in Figure 1-7.

14

Chapter 1 ■ Basic Computer Architecture

Figure 1-7. Intel 64, Registers, and Stack The hardware stack is most useful for implementing function calls in high-level languages. When function A calls another function B, it uses the stack to save the calculation context to return to it after B completes. Here are some key facts about the hardware stack, most of which stem from its description: 1 There is no such thing as an empty stack, even if we push zero times. A pop algorithm can run either way, most likely returning a garbage "top" stack element. 2. The stack grows towards address zero. 3. Almost all types of its operands are considered signed integers and can therefore be expanded with a sign bit. For example, pushing with a B916 argument will result in the following unit of data being pushed onto the stack: 0xff b9, 0xffffffb9, or 0xff ff ff ff ff ff ff b9. By default, push uses an operand size of 8 bytes. Therefore, a push -1 instruction will store 0xff ff ff ff ff ff ff ff onto the stack. 4. Most architectures that support the stack use the same principle with its top defined by some register. What differs, however, is the meaning of the respective address. On some architectures it is the address of the next element, which will be written to the next key. In others, it is the address of the last element already placed on the stack.

15

Chapter 1 ■ Basic Computer Architecture

■■Working with Intel Documents: How to Read Instruction Descriptions Open the second volume of [15]. Find the page corresponding to the push statement. Start with a table. For our purposes, we will only investigate the OPCODE, INSTRUCTION, 64-BIT MODE, and DESCRIPTION columns. The OPCODE field defines the machine code of an instruction (opcode). As you can see, there are options and each option corresponds to a different DESCRIPTION. This means that sometimes not only the operands but also the opcodes themselves vary. INSTRUCTION describes the instruction mnemonics and allowed types of operands. Here R represents any general purpose register, M represents the memory location, IMM represents the immediate value (eg an integer constant like 42 or 1337). A number defines the size of the operand. If only specific records are allowed, they will be named. For example: • push r/m16: Push a 16-bit general purpose register or 16-bit number popped from memory onto the stack. • push CS—to push a cs segment register. The DESCRIPTION column provides a brief explanation of the statement's effects. It is often enough to understand and use the instruction. • Read additional explanation about pushing. When is the sign of the operand not extended? • Explain all effects of the push rsp instruction on memory and registers.

1.6 Overview In this chapter, we provide a quick overview of the von Neumann architecture. We started adding features to this model to make it more suitable for describing modern processors. So far, we've taken a closer look at registers and the hardware stack. The next step is to start programming in assembler, and that's what the next chapter is dedicated to. Let's look at some sample programs, identify several new architectural features (such as endianness and addressing modes), and design a simple input/output library for *nix to make it easier for a user to interact with it.

■■Question 2  What are the key principles of von Neumann architecture? ■■Question 3  What are records? ■■Question 4  What is the hardware stack? ■■Question 5  What are interrupts? ■■Question 6 What are the main problems that modern extensions of the von Neumann model are trying to solve? ■■Question 7  What are the Intel 64 main general purpose registers? ■■Question 8  What is the purpose of the stack pointer? ■■Question 9  Could the stack be empty? ■■Question 10  Can we count items in a pile? sixteen

EPISODE 2

Assembly Language In this chapter, we'll start practicing assembly language by gradually writing more complex programs for Linux. We'll look at some architectural details that impact writing all kinds of programs (for example, endianness). We chose a *nix system in this book because it is much easier to program in assembler compared to Windows.

2.1 Setting up the environment It is impossible to learn to program without trying to program. So let's start programming in assembler right now. We are using the following setup to complete the assembler and C mappings: • Debian GNU\Linux 8.0 as the operating system. • NASM 2.11.05 as an assembly language compiler. • GCC 4.9.2 as the C language compiler. This exact version is used to produce assemblies of C programs. The Clang compiler can also be used. • GNU Make 4.0 as build system. • GDB 7.7.1 as a debugger. • The text editor of your choice (preferably with syntax highlighting). We advocate the use of ViM. If you want to set up your own system, install whatever Linux distribution you like and be sure to install the programs listed. As far as we know, Windows Subsystem for Linux is also suitable for all tasks. You can install it and install the necessary packages using apt-get. See the official guide found at: https://msdn.microsoft.com/en-us/commandline/wsl/install_guide. On the Apress website for this book, http://www.apress.com/us/book/9781484224021, you can find the following: • Two virtual machines preconfigured with the entire toolchain installed. One has a desktop environment; the other is just the minimal system that can be accessed via SSH (Secure Shell). Installation instructions and other usage information are in the README.txt file in the downloaded file. • A link to the GitHub page with all book listings, answers to questions, and solutions.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_2

17

Chapter 2 ■ Assembly Language

2.1.1 Working with Code Examples Throughout this chapter, you will see several code examples. Compile them, and if you have trouble understanding their logic, try stepping through them using gdb. It is a great help to study code. See Appendix A for a quick tutorial on gdb. Appendix D provides more information about the system used for performance testing.

2.2 Writing “Hello World” 2.2.1 Basic Input and Output Unix ideology postulates that “everything is a file”. A file, broadly speaking, is anything that looks like a stream of bytes. Through files, you can abstract things like • accessing data on a hard drive/SSD; • data exchange between programs; and • interaction with external devices. We will continue the tradition of writing a simple "Hello world!" program to start. Display a welcome message on the screen and finish. However, such a program must display characters on the screen, which cannot be done directly if a program does not run on complete hardware without an operating system to take care of its activity. The purpose of an operating system is, among other things, to abstract and manage resources, and visualization is certainly one of them. It provides a set of routines to handle communication with external devices, other programs, file systems, etc. Normally, a program cannot bypass the operating system and directly interact with the resources it controls. It is limited to system calls, which are routines provided by an operating system to user applications. Unix identifies a file with its descriptor as soon as a program opens it. A descriptor is nothing more than an integer value (like 42 or 999). A file is opened explicitly by invoking the open system call; however, three important files are opened as soon as a program starts and therefore should not be managed manually. These are stdin, stdout and stderr. Their descriptors are 0, 1 and 2, respectively. stdin is used to handle input, stdout to handle output, and stderr is used to generate information about the program execution process, but not its results (for example, errors and diagnostics). By default, keyboard input is bound to stdin and terminal output is bound to stdout. It means "Hello world!" must write to stdout. So we need to invoke the recording system call. Writes a given number of bytes from memory starting at a given address to a file with a given descriptor (in our case 1). Bytes will encode string characters using a predefined table (ASCII table). Each entry is one character; an index into the table matches its code in the range 0 to 255. See Listing 2-1 for our first complete example of an assembler program. Listing 2-1. hello.asm global _start section .data message: db 'hello world!', 10 section .text _start: movrax, 1;system call number should be stored in rax movrdi, 1; argument #1 in rdi: where to write (descriptor)?

18

Chapter 2 ■ Assembly Language

movrsi, message; argument #2 in rsi: where does the string start? movrdx, 14; argument #3 in rdx: how many bytes to write? system call; this instruction invokes a system call This program invokes a system call writing with correct arguments on lines 6-9. It's really all it does. The next sections will explain this sample program in more detail.

2.2.2 Program structure As we remember from the description of the von Neumann machine, there is only one memory, both for the code and for the data; these are indistinguishable. However, a programmer wants to separate them. An assembly program is usually divided into sections. Each section has its use: for example, .text contains instructions, .data is for global variables (data available at every moment of program execution). You can switch from one section to another; In the resulting program, all data corresponding to each section will be gathered in one place. To get rid of numerical address values, programmers use labels. They are just readable names and addresses. They can precede any command and are usually separated by a colon. There is a tag in this program on line 5. _start. A notion of variable is typical of higher-level languages. In assembly language, in fact, the notions of variables and procedures are quite subtle. It is more convenient to talk about labels (or addresses). An assembly program can be split into several files. One of them must contain the _start tag. It is the entry point; marks the first instruction to be executed. This tag must be declared global (see line 1). The significance of this will become apparent later. Comments begin with a semicolon and last until the end of the line. Assembly language consists of commands, which map directly to machine code. However, not all language constructs are commands. Others control the translation process and are often referred to as directives.1 In the “Hello world!” For example, there are three directives: global, section, and db.

■■Note  Assembly language is generally case insensitive, but tag names are not! mov, moV, Mov are all the same, but global _start and global _START are not! Section names are case sensitive.

responsive too: .DATA section and .data section differ! The db directive is used to create byte data. Typically, data is defined by one of these directives, which differ depending on the data format: • db: bytes; • dw—callwords, equal to 2 bytes each; • dd—double words, equal to 4 bytes; and • dq: quadruple words, equal to 8 bytes. Let's look at an example in Listing 2-2. Listing 2-2. data_decl.asm section .data example1: db 5, 16, 8, 4, 2, 1 example2: times 999 db 42 example3: dw 999 1

The NASM manual also uses the name "pseudoinstruction" for a specific subset of directives.

19

Chapter 2 ■ Assembly Language

times n cmd is a directive to repeat cmd n times in program code. Like if you copy and paste n times. It also works with instructions from the central processing unit (CPU). Note that you can create data in any section, including .text. As we said earlier, for a CPU, data and instructions are all the same and the CPU will try to interpret the data as encoded instructions when requested. These directives allow you to define multiple data objects one by one, as in Listing 2-3, where a string of characters is followed by a single byte equal to 10. Listing 2-3. hello.asm message: db 'hello world!', 10 Letters, digits and other characters are encoded in ASCII. Programmers agreed on a table, where each character is given a unique number - its ASCII code. We start in the direction corresponding to the message on the label. We store the ASCII codes for all the letters in the string "hello world!" and we add a byte equal to 10. Why 10? By convention, to start a new line we generate a special character with code 10.

■■Terminological chaos  It is quite common to refer to the computer's most native integer format as a machine word. Since we are programming a 64-bit computer, where addresses are 64-bit and general purpose registers are 64-bit, it is quite convenient to consider the machine word size as 64 bits or 8 bytes. In assembly programming for the Intel architecture, the term word was used to describe 16-bit data input, because on older machines it was exactly the word machine. Unfortunately, for legacy reasons, it's still used like the old days. That's why 32-bit data is called doublewords and 64-bit data is called quadwords.

2.2.3 Basic Instructions The mov instruction is used to write a value to the register or memory. The value can be obtained from another register or from memory, or it can be immediate. However, 1. mov cannot copy data from memory to memory; 2. source and destination operands must have the same size. The syscall instruction is used to make system calls on *nix systems. I/O operations are hardware dependent (which may also be used by multiple programs at the same time), so programmers cannot control them directly without going through the operating system. Each system call has a unique number. To do this 1. The rax record must contain the system call number; 2. The following registers must contain their arguments: rdi, rsi, rdx, r10, r8 and r9. The system call cannot accept more than six arguments. 3. Execute the syscall statement. It does not matter in what order the registers are initialized. Note that the syscall instruction changes rcx and r11. We will explain the reason later. When we type "Hello, world!" program, we use a simple recording system call. Supports 1. File descriptor; 2. The address of the buffer. We start taking consecutive bytes to write from here; 3. The number of bytes to write.

20

Chapter 2 ■ Assembly Language

To compile our first program, save the code in hello.asm2 and run these commands in the shell: > nasm -felf64 hello.asm -o hello.o > ld -o hello hello.o > chmod u+x hello The details The process construction along with the construction stages will be discussed in Chapter 5. Let's launch "Hello World!" > ./hello hello world! Segmentation fault We clearly left what we wanted. However, the program appears to have caused an error. What did we do wrong? After executing a system call, the program continues its work. We didn't write any instructions after the syscall, but the memory contains some random values ​​in the following cells.

■■Note  If you haven't put anything in any memory address, it will probably contain some kind of garbage, not zeros or any kind of valid instructions. A processor has no idea whether these values ​​were meant to encode instructions or not. So, following its own nature, it tries to interpret them, because the rip record points to them. These values ​​are very unlikely to encode correct instructions; therefore, an interrupt with code 6 (invalid instruction) will occur.3 So what do we do? We need to use the exit system call, which terminates the program normally, as shown in Listing 2-4. Listing 2-4. hello_proper_exit.asm .data section message: db 'hello world!', 10 .text section global _start _start: movrax, movrdi, movrsi, movrdx, syscall

1; 1; message; 14;

number 'write' system call stdout descriptor string address length string in bytes

movrax, 60; 'exit' number syscall xorrdi, rdi syscall

Remember, all source code including listings can be found at www.apress.com/us/book/9781484224021 and is also stored in the preconfigured virtual machine's home directory. 3 Even if not, sequential execution will soon bring the processor to the end of allocated virtual addresses, see section 4.2. In the end, the operating system will terminate the program because it is unlikely to recover. two

21

Chapter 2 ■ Assembly Language

■■Question 11  What does the xor rdi, rdi instruction do? ■■Question 12  What is the program return code? ■■Question 13  What is the first argument of the output system call?

2.3 Example: Output Log Contents It's time to try something a little harder. Let's generate the rax value in hexadecimal format, as shown in Listing 2-5. Listing 2-5. Print rax value: print_rax.asm section .datacodes: db'0123456789ABCDEF' section .text global _start _start: ; number 1122... in hexadecimal format mov rax, 0x1122334455667788 mov rdi, 1 mov rdx, 1 mov rcx, 64 ; Every 4 bits must be output as a hexadecimal digit; Use shift and bitwise AND to isolate them; the result is the offset in the 'codes' array loop: push rax sub rcx, 4 ; cl is a register, the smallest part of rcx; rax -- eax -- ax -- ah + al ; rcx -- ecx -- cx -- ch + cl sar rax, c ly rax, 0xf read rsi, [codes + rax] mov rax, 1 ; syscall leaves rcx and r11 changed push rcx syscall pop rcx pop rax ; test can be used for the fastest 'is it a zero?' Checks ; see docs for 'test' command test rcx, rcx jnz .loop

22

Chapter 2 ■ Assembly Language

movrax, 60 ;invokes the 'exit' system call xorrdi, rdi syscall By changing the value of rax and doing the logical AND with the 0xF mask, we transform the integer into one of its hexadecimal digits. Each digit is a number from 0 to 15. Use it as an index and add it to the tag code address to get the character that represents it. For example, given rax = 0x4A we will use the indices 0x4 = 410 and 0xA = 1010.4 The first one will give us a character '4' whose code is 0x34. The second will return the character 'a' whose code is 0x61.

■■Question 14  Check that the ASCII codes mentioned in the last example are correct. We can use a hardware stack to save and restore register values, like around the syscall instruction.

■■Question 15  What is the difference between sar and shr? See Intel documents. ■■Question 16 How are numbers written in different number systems in a way that is understandable to the NASM? Consult the NASM documentation.

■■Note  When a program starts, the value of most registers is not well defined (it can be absolutely random). It's a big source of errors for beginners, as they tend to assume they're at zero.

2.3.1 Local Tags Note the unusual name of the .loop tag: it starts with a dot. This tag is local. We can reuse tag names without causing name conflicts as long as they are local. The last used dotless global tag is the basis for all subsequent local tags (until the next global tag is produced). The full name of the .loop tag is _start.loop. We can use this name to address it from anywhere in the program, even after other global tags have occurred.

2.3.2 Relative Addressing This demonstrates how to address memory in a more complex way than just the immediate address. Listing 2-6. Relative addressing: print_rax.asm read rsi, [codes + rax] Brackets indicate indirect addressing; the address is written inside them. • mov rsi, rax: copies rax to rsi • mov rsi, [rax]: copies the memory contents (8 sequential bytes) of the address, stored in rax, to rsi. How do we know we have to copy exactly 8 bytes? As we know mov operands are same size and rsi size is 8 bytes. Knowing these facts, the assembler can deduce that exactly 8 bytes must be removed from memory. 4

The subscript denotes the base of the number system.

23

Chapter 2 ■ Assembly Language

The lea and mov instructions have a subtle difference between their meanings. read means "load effective address". It lets you calculate an address from a memory cell and store it somewhere. This is not always trivial, as there are complicated addressing modes (as we will see later): for example, the address can be the sum of several operands. Listing 2-7 provides a quick demonstration of what lea and mov are doing. Listing 2-7. read_vs_mov.asm; rsi nasm -f elf64 -o hello.o hello.asm > ld -o hello hello.o We use NASM first to produce an object file. Its format, elf64, was specified by the -f key. We then use another program, ld (a linker), to produce a ready-to-run file. We'll use this file format as an example to show what the linker actually does.

5.3.1 Executable and Linkable Format ELF (Executable and Linkable Format) is a format for object files quite typical of *nix systems. Let's stick to its 64-bit version. ELF allows three types of files. 1. Relocatable object files are .o files, produced by the compiler. Relocation is a process of assigning end addresses to various parts of the program and changing the program code so that all links are allocated correctly. We are talking about all kinds of memory accesses by absolute addresses. Relocation is necessary, for example, when the program consists of several modules, which refer to each other. The order in which they will be placed in memory is not yet fixed, so the absolute addresses are not determined. Linkers can combine these files to produce the following type of object files. 2. The executable object file can be loaded into memory and run immediately. It is essentially structured storage for code, data, and useful information.

74

Chapter 5 ■ Build Pipeline

3. Shared object files can be loaded when the main program needs them. They are linked to it dynamically. In Windows OS these are known dll files; on *nix systems, their names usually end with .so. The goal of any linker is to create an executable (or shared) object file, given a set of relocatable objects. To do this, a linker must perform the following tasks: • Relocation • Symbol resolution. Each time a symbol (function, variable) is dereferenced, a linker has to modify the object file and fill in the instruction part, corresponding to the address of the operand, with the correct value.

5.3.1.1 Structure An ELF file starts with the main header, which stores global metadata. See Listing 5-21 for a typical ELF header. The hello file is the result of compiling a "Hello world!" program shown in Listing 2-4. Listing 5-21. hello_elfheader ELF Header: ELF Header: Magic:7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI:UNIX - System V ABI Version:0 Type:EXEC (executable file) Host:Advanced Micro Devices X86-64 Version:0x1 Entry Point Address:0x4000b0 Start of Program Headers:64 (bytes in file) Start of Section Headers:552 (bytes in file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 2 Size of section headers: 64 (bytes) Number of section headers: 6 Section header String Table index: 3 ELF files then Provides information about a program that can be viewed from two points of view: • Link view, consisting of sections.

It is described by the section table, which can be accessed via readelf -S.

Each section in turn can be: –– Raw data to be loaded into memory. –– Formatted metadata about other sections, used by loader (eg .bss), linker (eg relocation tables), or debugger (eg .line).

75

Chapter 5 ■ Build Pipeline

Code and data are stored within sections. • Execution view, made up of segments.

It is described by a program header table, which can be studied using readelf -l. We'll look at this more closely in Section 5.3.5.

Each entry can describe: Some type of information that the system needs to run the program. –– An ELF segment, containing zero or more sections. They have the same set of permissions (read, write, execute) applied by virtual memory. Each segment has a starting address and is loaded into a separate memory region consisting of consecutive pages.

After reviewing Listing 5-21, we noticed that it accurately describes the position and dimensions of the program and section titles. We start with section views, as the linker works primarily with them.

5.3.1.2 Sections in ELF files The assembly language allows manual section controls. The NASM section corresponds to sections of object files. You've already seen some of them, namely .text and .data. Below is the list of the most used sections; the complete list can be found in [24]. .text stores machine instructions. .rodata stores read-only data. .data stores initialized global variables. .bss stores readable and writable global variables, initialized to zero. There's no need to dump their contents into an object file, as they're all zeroed out anyway. Instead, the full size of the section is stored. An operating system may know faster ways to initialize this memory than manually zeroing it out. In assembler, you can put data here by putting resb, resw, and similar directives after the .bss section. .rel.text stores the relocation table for the .text section. It is used to remember places where a linker must modify .text after choosing the load direction for this particular object file. .rel.data stores a relocation table for data referenced in the module. .debug stores a symbol table used to debug the program. If the program was written in C or C++, it will store information not only about global variables (as .symtab does), but also about local variables. .line matches code snippets and line numbers in source code. We need this because the correspondence between lines of source code in high-level languages ​​and assembly instructions is not straightforward. This information allows you to debug a program in a higher-level language line by line. .strtab stores strings. It's like an array of strings. Other sections, like .symtab and .debug, don't use immediate strings, but index them into .strtab. .symtab stores a symbol table. Whenever a programmer defines a label, NASM creates a symbol for it.3 This table also stores useful information, which we'll look at later. Now that we have a general understanding of the ELF file link display, let's look at some examples to show the particulars of three different types of ELF files.

5.3.2 Relocatable Object Files Let's investigate an object file, obtained by compiling a simple program, shown in Listing 5-22.

3

Not to be confused with preprocessor symbols!

76

Chapter 5 ■ Build Pipeline

Listing 5-22. symbol.asm .data section datavar1: dq 1488 datavar2: dq 42 .bss section bssvar1: resq 4*1024*1024 bssvar2: resq 1 outer .text section somewhere global _start mov rax, datavar1 mov rax, bssvar1 mov rax, bssvar2 mov rdx , datavar2 _start: jmp _start ret textlabel: dq 0 This program uses external and global directives to mark symbols in a different way. These two directives control the creation of a symbol table. By default, all symbols are local to the current module. extern defines a symbol that is defined in other modules but referenced in the current one. On the other hand, global defines a globally available symbol that other modules can refer to by defining it as external within them.

■■Avoid confusion  Don't confuse global and local symbols with global and local labels! GNU binutils is a collection of binary tools used for working with object files. It includes several tools that are used to explore object file contents. Several of them are of special interest to us. • If you only need to refer to the symbol table, use nm. • Use objdump as a universal tool to display general information about an object file. In addition to ELF, it supports other object file formats. • If you know the file is in ELF format, readelf is usually the best and most informative choice. Let's feed this program to objdump to produce the results shown in Listing 5-23. Listing 5-23. Symbols > nasm -f elf64 main.asm && objdump -tf -m intel main.o main.o: file format elf64-x86-64 architecture: i386: x86-64, flags 0x00000011: HAS_RELOC, HAS_SYMS starting address 0x0000000000000000

77

Chapter 5 ■ Build Pipeline

SYMBOL TABLE: 0000000000000000 ldf *ABS*0000000000000000 0000000000000000 ld.data0000000000000000 0000000000000000 ld.bss0000000000000000 0000000000000000 ld.text0000000000000000 0000000000000000 l.data0000000000000000 0000000000000008 l.data0000000000000000 0000000000000000 l.bss0000000000000000 0000000002000000 l.bss0000000000000000 0000000000000029 l.text0000000000000000 0000000000000000*UND*0000000000000000 0000000000000028 g.text0000000000000000

main.asm .data .bss .text datavar1 datavar2 bssvar1 bssvar2 text label somewhere _start

We are presented with a table of symbols, where each symbol is annotated with useful information. What do your columns mean? 1. Virtual address of the provided symbol. For now, we don't know the starting addresses of the section, so all virtual addresses are given relative to the beginning of the section. For example, datavar1 is the first variable stored in .data, its address is 0 and its size is 8 bytes. The second variable, datavar2, is in the same section with an offset greater than 8, next to datavar1. Since somewhere is defined as external, it is obviously located in some other module, so for now its address has no meaning and is left at zero. 2. A string of seven letters and spaces; each letter characterizes a symbol in some way. Some of them interest us. (a) l, g,- – local, global or none. (b) … (c) … (d) … (e) I,- – a link to another symbol or to a common symbol. (f ) d, D,- – debug symbol, dynamic symbol, or a common symbol. (g) F, f, O,- – function name, file name, object name or a common symbol. 3. What section does this label correspond to? *UND* for unknown section (symbol is referenced but not defined here), *ABS* means no section. 4. Usually this number shows an alignment (or lack thereof). 5. Symbol name. For example, let's investigate the first symbol shown in Listing 5-23. Is f a filename, d needed for debugging purposes only, l location to this module. The _start global tag (which is also an entry point) is marked with the letter g in the second column.

■■Note  Symbol names are case sensitive: _start and _START are different.

78

Chapter 5 ■ Build Pipeline

Since the addresses in the symbol table are still not the actual virtual addresses, but the ones relative to the sections, we can ask ourselves: what do they look like in machine code? The NASM has already done its duty and the machine instructions need to be assembled. We can look inside interesting sections of object files by calling objdump with -D (dismount) and optionally -M intel-mnemonic (to display Intel style syntax instead of AT&T). Listing 5-24 shows the results.

■■How to read disassembly dumps The left column is usually the absolute address where the data will be loaded. Before the link, it is an address relative to the beginning of the section. The second column shows the raw bytes as hexadecimal numbers. The third column can contain the results of unmounting the mount command mnemonics. Listing 5-24. objdump_d > objdump -D -M intel-mnemonic main.o main.o:file format elf64-x86-64 .data pruning section: 00000000000000000 :... 0000000000000008 :... .bss pruning section: 0000000000000000 :.. .00000000002 :... Stripping section .text: 0000000000000000 : 0:48 b8 00 00 00 00 00movabs 7:00 00 00 a:48 b8 00 00 00 00 00movabs 11:00 00 8 0 40 4 00 00movabs 1b:000 00 1e:48 ba 00 00 00 00 00movabs 25:00 00 00 0000000000000028 : 28:c3ret 0000000000000029 :

rax,0x0 rax,0x0 rax,0x0 rdx,0x0

The operand mov in the .text section with offsets 0 and 14 from the beginning of the section must be address datavar1, but it equals zero! The same thing happened with bssvar. This means that the linker must change the compiled machine code by filling in the correct absolute addresses in the instruction arguments. To achieve this, for each symbol, all references to it in the relocation table are remembered. Once the linker understands what its true virtual address will be, it goes through the list of symbol occurrences and fills in the gaps. There is a separate relocation table for each section that needs one. To view the relocation tables, use readelf --relocs. See Listing 5-25.

79

Chapter 5 ■ Build Pipeline

Listing 5-25. readelf_relocs > readelf --relocsmain.o The relocation section '.rela.text' at offset 0x440 contains 4 entries: OffsetInfoTypeSym. ValueName+Addend 000000000002000200000001 R_X86_64_640000000000000000 .data + 0 00000000000c000300000001 R_X86_64_640000000000000000 .bss + 0 000000000016000300000001 R_X86_64_640000000000000000 .bss + 2000000 000000000020000200000001 R_X86_64_640000000000000000 .data + 8 An alternative way to display the symbol table is to use a more lightweight and minimalistic nm utility. For each symbol, it shows the virtual address, type and name of the symbol. Note that the type flag has a different format compared to objdump. See Listing 5-26 for a minimal example. Listing 5-26. nm > main nm.o 0000000000000000 b 00000000000000000 d U 000000000000000a T 0000000000000000b t

bssvar datavar somewhere _start textlabel

5.3.3 Executable Object Files The second type of object file can be executed immediately. It retains its structure, but the addresses are now bound to exact values. Let's look at another example, shown in Listing 5-27. Includes two global variables, some and private, one of which is available to all modules (marked global). Also, a symbol function is marked global. Listing 5-27. executable_object.asm global somewhere function global .data section somewhere: dq 999 private: dq 666 .text section function: mov rax, somewhere ret Let's compile it normally using nasm -f elf64 and then link it using ld with the old object file, obtained by compiling the file shown in Listing 5-22. Listing 5-28 shows the changes to objdump's output.

80

Chapter 5 ■ Build Pipeline

Listing 5-28. objdump_tf > > > >

nasm -f elf64 symbol.asm nasm -f elf64executable_object.asm ld symbol.o executable_object.o -o main objdump -tf main

main:file format elf64-x86-64 architecture: i386:x86-64, flags 0x00000112: EXEC_P, HAS_SYMS, D_PAGED start address 0x0000000000000000 SYMBOL TABLE: 00000000004000b0 ld.code0000000000000000 00000000006000bc ld.data0000000000000000 0000000000000000 ldf *ABS*0000000000000000 00000000006000c4 l.data0000000000000000 00000000006000bc g.data0000000000000000 0000000000000000*UND*0000000000000000 00000000006000cc g.data0000000000000000 00000000004000b0 gF .code0000000000000000 00000000006000cc g.data0000000000000000 00000000006000d0 g.data0000000000000000

private .code .data executable_object.asm somewhere _start __bss_start func _edata _end

The flags are different: now the file (EXEC_P) can be executed; there are no more relocation tables (HAS_RELOC flag is cleared). The virtual addresses are now intact, as are the encoded addresses. This file is ready to be loaded and executed. It maintains a symbol table, and if you want to shrink it and make the executable smaller, use the strip utility.

■■Question 71  Why does ld issue a warning if _start is not marked global? Get the address of the entry point in this case using readelf with the appropriate arguments. ■■Question 72  Find out the ld option to automatically delete the symbol table after linking.

5.3.4 Dynamic Libraries Almost all programs use library code. There are two types of libraries: static and dynamic. Static libraries consist of multiple relocatable object files. They are linked into the main program and merged with the resulting executable file. In the Windows world, these files have a .lib extension. In the Unix world, these are .o or .a files that contain multiple .o files inside. Dynamic libraries are also known as shared object files, the third of the three types of object files we defined above. They are linked to the program during its execution. In the Windows world, these are the famous .dll files. In the Unix world, these files have a .so (shared objects) extension.

81

Chapter 5 ■ Build Pipeline

While static libraries are just messy executables with no entry points, dynamic libraries have some differences that we'll look at now. Dynamic libraries are loaded when they are needed. Since they are object files themselves, they have all sorts of meta information about the code they provide for external use. This information is used by a loader to determine the exact addresses of exported data and functions. Dynamic libraries can be shipped separately and updated independently. It's good and bad. While the library manufacturer can provide bug fixes, they can also break backwards compatibility, for example by changing function arguments, effectively sending a delayed action mine. A program can work with any number of shared libraries. These libraries must be able to load in either direction. Otherwise they would be stuck at the same address, which puts us in the exact same situation of trying to run multiple programs in the same physical memory address space. There are two ways to achieve this: • We can perform a relocation at runtime, when the library is being loaded. However, it robs us of a very attractive feature: the ability to reuse library code in physical memory without duplication when it is being used by multiple processes. If each process performs library relocation to a different address, the corresponding pages are patched with different address values ​​and therefore become different for different processes. Effectively, the .data section would be relocated anyway due to its mutable nature. Giving up global variables allows us to discard both the section and the need to relocate it. Another issue is that the .text section must be writable to be modified during the relocation process. Introduces certain security risks, making it possible to be modified by malicious code. Also, changing the .text of each shared object when multiple libraries are needed to run an executable can be time consuming. • We can write PIC (Position Independent Code). It is now possible to write code that can run regardless of where it resides in memory. For that, we need to get rid of absolute addresses completely. Currently, processors support rip-relative addressing, such as mov rax, [rip + 13]. This function facilitates the generation of PICs.

This technique lets you share .text sections. Today programmers are strongly recommended to use PIC instead of reallocations.

■■Note  Whenever you use non-constant global variables, you prevent your code from being reinsertable i.e. running in multiple threads concurrently without changes. Consequently, you will have a hard time reusing it in a shared library. It's one of many arguments against a global mutable state in the program. Dynamic libraries save disk space and memory. Keep in mind that pages can be marked private or shared across multiple processes. If a library is used by multiple processes, most of its parts will not be duplicated in physical memory. We'll show you how to build a minimal shared object now. However, we'll defer discussion of things like global shift tables and procedure linkage tables until Chapter 15. Listing 5-29 shows the minimal content of shared objects. Note the outer symbol _GLOBAL_OFFSET_TABLE y: function specification for global symbol func. Listing 5-30 shows a minimal initializer that calls a function on a shared object file and exits correctly.

82

Chapter 5 ■ Build Pipeline

Listagem 5-29. libso.asm Extern_GLOBAL_OFFSET_TABLE_ global func: function .rodata section message: db "Shared object write this", 10, 0 .text section func: movrax, movrdi, movrsi, movrdx, syscall ret

1 1 message 14

Listing 5-30. libso_main.asm global _start extern func section .text _start: mov rdi, 10 call func mov rdi, rax mov rax, 60 syscall Listing 5-31 shows the compilation commands and two views of an ELF file. Note that the dynamic library has more specific sections like .dynsym. The .hash, .dynsym, and .dynstr sections are required for relocation. .dynsym stores symbols visible from outside the library. .hash is a hash table, needed to decrease symbol lookup time for .dynsym. .dynstr stores strings, named by their .dynsym index. Listing 5-31. libso > nasm -f elf64 -o main.o main.asm > nasm -f elf64 -o libso.o libso.asm > ld -o main main.o -d libso.so > ld -shared -o libso.so libso .o --dynamic-linker=/lib64/ld-linux-x86-64.so.2 > readelf -S libso.so There are 13 section headers, starting at offset 0x5a0:

83

Chapter 5 ■ Build Pipeline

(Video) PEI RP 1200 NEIWPCC UST Training Webinar

Section Headers: [Nr] NameTypeAddressOffset SizeEntSizeFlagsLinkInfoAlign [ 0]NULL000000000000000000000000 00000000000000000000000000000000000 [ 1] .hashHASH00000000000000e8000000e8 000000000000002c0000000000000004A208 [ 2] .dynsymDYNSYM000000000000011800000118 00000000000000900000000000000018A328 [ 3] .dynstrSTRTAB00000000000001a8000001a8 000000000000001e0000000000000000A001 [ 4] .rela.dynRELA00000000000001c8000001c8 00000000000000180000000000000018A208 [ 5] .textPROGBITS00000000000001e0000001e0 000000000000001c0000000000000000AX0016 [ 6] .rodataPROGBITS00000000000001fc000001fc 000000000000001a0000000000000000A004 [ 7] .eh_framePROGBITS000000000000021800000218 00000000000000000000000000000000A008 [ 8] .dynamicDYNAMIC000000000020021800000218 00000000000000f00000000000000010WA308 [ 9] .got.pltPROGBITS000000000020030800000308 00000000000000180000000000000008WA008 [10] .shstrtabSTRTAB000000000000000000000320 00000000000000650000000000000000001 [11] .symtabSYMTAB000000000000000000000388 00 000000000001C800000000000000001812158 [12] .strtabstrtab0000000000000000000000000550 000000000000004F0000000000000000001 , G (grupo), T (TLS), E (excluir), x (desconocido) O (se requiere procesamiento adicional del sistema operativo) o (específico del sistema operativo), p (específico del procesador ) > readelf -S principal Hay 14 encabezados de sección, comenzando at offset 0x650: Section Headers: [Nr] NameTypeAddressOffset SizeEntSizeFlagsLinkInfoAlign [ 0]NULL000000000000000000000000 00000000000000000000000000000000000 [ 1] .interpPROGBITS000000000040015800000158 000000000000000f0000000000000000A001 [ 2] .hashHASH000000000040016800000168 00000000000000280000000000000004A308 [ 3] .dynsymDYNSYM000000000040019000000190 00000000000000780000000000000018A418 [ 4 ] .dynstrSTRTAB000000000040020800000208 00000000000000270000000000000000A001 [ 5] .rela.pltRELA000000000040023000000230 0000000000000018001800

84

Chapter 5 ■ Build Pipeline

[ 6] .pltPROGBITS000000000040025000000250 00000000000000200000000000000010AX0016 [ 7] .textPROGBITS000000000040027000000270 00000000000000140000000000000000AX0016 [ 8] .eh_framePROGBITS000000000040028800000288 00000000000000000000000000000000A008 [ 9] .dynamicDYNAMIC000000000060028800000288 00000000000001100000000000000010WA408 [10] .got.pltPROGBITS000000000060039800000398 00000000000000200000000000000008WA008 [11] .shstrtabSTRTAB0000000000000000000003b8 00000000000000650000000000000000001 [12] .symtabSYMTAB000000000000000000000420 00000000000001e0000000000000001813158 [13] .strtabSTRTAB000000000000000000000600 000000000000004d0000000000000000001

■■Question 73  Study the symbol tables for a shared object fetched using readelf --dyn-syms and objdump -ft. ■■Question 74  What is the meaning behind the LD_LIBRARY_PATH environment variable? ■■Question 75  Separate the first task into two modules. The first module will store all functions defined in lib.inc. The second will have the entry point and will call some of these functions. ■■Question 76 Grab one of the standard Linux utilities (from coreutils). Study your object file structure using readelf and objdump. What we'll see in this section applies to most situations. However, there is a bigger picture of different code models that affect addressing. We'll delve into these details in Chapter 15 after becoming more familiar with assembler and C. There we'll also review the dynamic libraries again and introduce the notions of the Global Offset Table and Procedure Binding Table.

5.3.5 Loader Loader is a part of the operating system that prepares the executable file for execution. This includes mapping your relevant sections into memory, initializing .bss, and sometimes mapping other disk files. The program headers for a symbol.asm file, shown in Listing 5-22, are shown in Listing 5-32. Listing 5-32. pht_symbols > nasm -f elf64 symbol.asm > nasm -f elf64 executable_object.asm > ld symbol.o executable_object.o -o main > readelf -l main Elf file type is EXEC (executable file) Entry point 0x4000d8 There are 2 program headers, starting at offset 64

85

Chapter 5 ■ Build Pipeline

Program Headers: TypeOffsetVirtAddrPhysAddr FileSizMemSizFlagsAlign LOAD0x0000000000000000 0x00000000004000000x0000000000400000 0x00000000000000e3 0x00000000000000e3R E200000 LOAD0x00000000000000e4 0x00000000006000e40x00000000006000e4 0x0000000000000010 0x000000000200001cRW200000 Section to Segment mapping: Segment Sections... 00.text 01.data .bss The table tells us that two segments are present. 1. Segment 00 • Loads at 0x400000 aligned at 0x200000. • Contains .text section. • Can be executed and can be read. It's not writable (so you can't overwrite the code). 2. 01 segment • Loaded at 0x6000e4 aligned with 0x200000. • Can read and write. Alignment means the real address will be closer to the beginning, divisible by 0x200000. Thanks to virtual memory, you can load all programs at the same starting address. Usually it is 0x400000. There are a few important notes to make: • Assembly sections with similar names defined in different files are merged. • A relocation table is not required in a pure executable file. Relocations remain partially for shared objects. Let's launch the resulting file and view its /proc//maps file as we did in Chapter 4. Listing 5-33 shows its sample contents. The executable is designed to repeat infinitely. Listing 5-33. Symbols_Mapas 00400000-00401000 R-XP 00000000 08:01 1176842/Home/Sayon/REST 02601000 RWXP 00000000 00:00 0 7FFE19CF2000-7FFE19D13000 RWXP 000000 000000 0000 00:00 0

86

Chapter 5 ■ Build Pipeline

7ffe19d40000-7ffe19d42000 r--p 00000000 00:00 0 [vvar] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall] As we can see, the program header tells us the truth about the location of sections.

■■Note  In some cases, you will find that the linker needs to be adjusted. Section loading directions and relative location can be adjusted through linking scripts, which describe the resulting file. These cases usually occur when you are programming an operating system or microcontroller firmware. This topic is beyond the scope of this book, but we recommend that you consult [4] if you have such a need.

5.4 Task: Dictionary This task will take us even further towards a working Forth interpreter. Some things might seem far-fetched, like the macro design, but it will be a good basis for an interpreter we'll do later on. Our task is to implement a dictionary. It will provide a mapping between keys and values. Each entry contains the address of the next entry, a key and a value. The keys and values ​​in our case are zero-terminated strings. The dictionary entries that make up a data structure are called a linked list. An empty list is represented by a null pointer, equal to zero. A non-empty list is a pointer to its first element. Each element has some kind of value and a pointer to the next element (or zero if it's the last element). Listing 5-34 shows an example of a linked list, containing elements 100, 200 and 300. It can be referenced by a pointer to its first element, ie x1. Listing 5-34. linked_list_ex.asm section .data x1: dq x2 dq 100 x2: dq x3 dq 200 x3: dq 0 dq 300 Linked lists are often useful in situations with multiple insertions and deletions in the middle of the list. However, accessing elements by index is difficult because it doesn't boil down to a simple addition of pointers. The mutual positions of linked list elements in flat memory are generally not predictable. In this task, the dictionary will be constructed statically as a list, with each newly defined element appended to it. You should use macros with local tags and symbol redefinition to automate the creation of linked lists. We explicitly tell it to create a colon macro with two arguments, where the first will contain a dictionary key string and the second will contain the name of the inner element representation. This differentiation is necessary because keystrings can sometimes contain characters that are not part of valid tag names (space, punctuation, arithmetic signs, etc.). Listing 5-35 shows an example of this type of dictionary.

87

Chapter 5 ■ Build Pipeline

Listing 5-35. linked_list_ex_macro.asm section .data colon "third word", third word db "third word explanation", 0 colon "second word", second_word db "second word explanation", 0 colon "first word", first_word db "first word explanation", 0 The task will contain the following files: 1. main.asm 2. lib.asm 3. dict.asm 4. colon.inc Follow these steps to complete the task: 1. Create a file A separate assembly file which contains the functions you already wrote in the first job. Let's call it lib.o. Don't forget to mark all necessary tags as global, otherwise they won't be visible outside of this object file! 2. Create a colon.inc file and define a colon macro to create dictionary words. This macro will take two arguments: • Dictionary key (in quotes). • Assembly tag name. Braces can contain spaces and other characters, which are not allowed in tag names. Each entry must start with a pointer to the next entry and then hold a key as a zero-terminated string. A programmer then directly describes the content, for example using db directives, as in the example shown in Listing 5-35. 3. Create a find_word function inside a new dict.asm file. It takes two arguments: (a) A pointer to a null-terminated key string. (b) A pointer to the last word in the dictionary. Having a pointer to the last word defined, we can follow the consecutive links to enumerate all the words in the dictionary. find_word will go through the entire dictionary, comparing a given key with all keys in the dictionary. If the record is not found, it returns zero; otherwise, it returns the address of the record. 4. A separate include file, words.inc, to define dictionary words using the colon macro. Include it in main.asm.

88

Chapter 5 ■ Build Pipeline

5. A simple _start function. You must perform the following actions: • Read the input string into a buffer of no more than 255 characters. • Try to find this key in the dictionary. If found, print the corresponding value. Otherwise, print an error message. Don't forget: all error messages must be written to stderr instead of stdout! We ship a set of stub files (see Section 2.1 "Environment Setup"); you are free to use them. An additional Makefile describes the build process; Type make in the allocations directory to create a main executable file. A quick tutorial of the GNU Make system is available in Appendix B. As with the first task, there is a test.py file for automated testing.

5.5 Summary In this chapter, we saw the different compilation steps. We studied the NASM macroprocessor in detail and learned conditionals and loops. Next, we talk about three types of object files: relocatable, executable, and shared. We worked out the structure of the ELF file and observed the relocation process performed by the linker. We covered shared object files and will review them again in Chapter 15.

■■Question 77  What is a linked list? ■■Question 78  What are the compilation steps? ■■Question 79  What is preprocessing? ■■Question 80  What is a macro instantiation? ■■Question 81  What is the %define directive? ■■Question 82  What is the %macro directive? ■■Question 83  What is the difference between %define, %xdefine and %assign? ■■Question 84  Why do we need the %% operator inside the macro? ■■Question 85  What types of conditions does the NASM macroprocessor support? What directives are used for this? ■■Question 86  What are the three types of ELF object files? ■■Question 87  What types of headers are present in an ELF file? ■■Question 88  What is relocation? ■■Question 89  What sections can be present in ELF files? ■■Question 90  What is a symbol table? What kind of information does it store? ■■Question 91 Is there a connection between sections and segments? ■■Question 92 Is there a connection between assembly sections and ELF sections? 89

Chapter 5 ■ Build Pipeline

■■Question 93  Which symbol marks the program entry point? ■■Question 94  What are the two different types of libraries? ■■Question 95 Is there a difference between a static library and a relocatable object file?

90

CHAPTER 6

Interrupts and System Calls In this chapter we will discuss two topics. First, as the von Neumann architecture lacks interactivity, interrupts were introduced to change that. While we're not going to delve into the hardware part of interrupts, we're going to learn exactly how the programmer sees interrupts. Also, we'll talk about the input and output ports used to communicate with external devices. Second, the operating system (OS) usually provides an interface to interact with the resources it controls: memory, files, CPU (central processing unit), etc. This is implemented through the system call mechanism. Transferring control to operating system routines requires a well-defined privilege escalation mechanism, and we'll see how this works on the Intel 64 architecture.

6.1 Input and Output When we were extending the von Neumann architecture to work with external devices, we mentioned interrupts only as a way of communicating with them. In fact, there is a second feature, input/output (I/O) ports, which complement it and allow data to be exchanged between the CPU and devices. Applications can access I/O ports in two ways: 1. Through a separate I/O address space. There are 216 1-byte addressable I/O ports, from 0 to FFFFH. The in and out commands are used to exchange data between the ports and the eax register (or parts of it). The write and read permissions of the ports are controlled by checking: • The IOPL (I/O privilege level) field of the rflags registers • The I/O permission bitmap of a task status thread. We'll talk about this in section 6.1.1. 2. Via memory-mapped I/O. A portion of the address space is specifically allocated to provide interaction with these external devices that respond like memory components. Consequently, any memory addressing instruction (mov, movsb, etc.) can be used to perform I/O with these devices. Standard threading and paging protection mechanisms are applied to these I/O tasks.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_6

91

Chapter 6 ■ Interrupts and System Calls

The IOPL field in the rflags record works as follows: if the current privilege level is less than or equal to the IOPL, the following instructions can be executed: • input and output (normal input/output). • inputs and outputs (string input/output). • cli and sti (clear/set interrupt flag). Therefore, setting IOPL on an individual application allows us to prohibit it from writing even if it is working with a higher privilege level than the user applications. In addition, Intel 64 allows for even more precise permission control through an I/O permissions bitmap. If the IOPL check passes, the processor checks the bit corresponding to the port used. Operation continues only if this bit is not set. The I/O permissions bitmap is part of the Task State Thread (TSS), which was created to be a unique entity of a process. However, because the hardware task switching mechanism is considered obsolete, only one TSS (and one I/O permission bitmap) can exist in long mode.

6.1.1 TR record and task status segment There are some protected mode artifacts that are still used in some form in long mode. A slicer is an example, it is now mainly used to implement protection rings. Another is a pair of tr record control structure and task status thread. The tr register contains the segment selector in the TSS descriptor. The latter resides in the GDT (Global Descriptor Table) and has a similar format to segment descriptors. Likewise, for segment registers, there is a shadow register, which is updated from the GDT when tr is updated via the ltr (load task register) instruction. The TSS is a region of memory used to store information about a task in the presence of a hardware task switching mechanism. Since no popular operating system has used it in protected mode, this mechanism has been removed from long mode. However, TSS is still used in long mode, albeit with a completely different structure and purpose. Currently, there is only one TSS used by an operating system, with the structure depicted in Figure 6-1.

92

Chapter 6 ■ Interrupts and System Calls

Figure 6-1. Long Mode Task Status Segment The first 16 bits store an offset into an I/O port permissions map, which we discussed in Section 6.1. Then the TSS has eight pointers to special interrupt (IST) stack tables and stack pointers to different rings. Whenever a privilege level changes, the stack automatically changes accordingly. Normally, the new value of rsp will be taken from the TSS field corresponding to the new protection ring. The meaning of IST is explained in section 6.2.

93

Chapter 6 ■ Interrupts and System Calls

6.2 Interrupts Interrupts allow us to change the flow of control of the program at an arbitrary point in time. During program execution, external events (device requires CPU attention) or internal events (division by zero, insufficient privilege level to execute an instruction, non-canonical address) can cause an interrupt, resulting in the execution of some other code. This code is called an interrupt handler and is part of an operating system or driver software. In [15], Intel separates external asynchronous interrupts from internal synchronous exceptions, but both are treated in the same way. Each interrupt is labeled with a fixed number, which serves as its identifier. It is not important to us exactly how the processor acquires the interrupt number from the interrupt handler. When the nth interrupt occurs, the CPU checks the interrupt descriptor table (IDT), which resides in memory. Analogously to GDT, its address and size are stored in idtr. Figure 6-2 depicts the idtr.

Figure 6-2. idtr record Each entry in the IDT occupies 16 bytes, and the nth entry corresponds to the nth interrupt. The entry incorporates some useful information as well as an interrupt handler address. Figure 6-3 depicts the interrupt descriptor format.

Figure 6-3. Interrupt Descriptor DPL Descriptor Privilege Level The current privilege level must be less than or equal to DPL to invoke this handler using the int statement. Otherwise, verification does not take place. Enter 1110 (Gate trap, IF is automatically cleared in the controller) or 1111 (Gate trap, IF is not cleared). The first 30 interrupts are reserved. This means that you can provide them with interrupt handlers, but the CPU will use them for its internal events like encoding invalid instructions. The system programmer can use other interrupts. When the IF flag is set, interrupts are handled; otherwise they are ignored.

94

Chapter 6 ■ Interrupts and System Calls

■■Question 96  What are non-maskable interrupts? What is your connection to the interrupt with code 2 and the IF flag? Application code runs with low privileges (in ring3). Direct device control is only possible at higher privilege levels. When a device requires attention by sending an interrupt to the CPU, the driver must run in a higher privilege ring, which requires changing the thread selector. And the stack? The battery must also be replaced. Here we have several options based on how we set the IST field of the interrupt descriptor. • If IST is 0, the default engine is used. When an interrupt occurs, ss is loaded with 0 and the new rsp is loaded from TSS. The RPL field of ss is then set to an appropriate privilege level. Then the old ss and rsp are saved on this new stack. • If an IST is established, one of the seven ISTs defined in the TSS is used. The reason ISTs are created is that some serious faults (non-maskable interrupts, double faults, etc.) can benefit from running on a known good stack. So a system programmer can create multiple stacks even for ring0 and use some of them to handle specific interrupts. There is a special int instruction, which accepts the interrupt number. Manually invokes an interrupt handler against its descriptor content. Ignore the IF flag: Whether it's on or off, the handler will be invoked. To control the execution of privileged code using the int statement, there is a DPL field. Before an interrupt handler starts executing, some registers are automatically saved onto the stack. These are ss, rsp, rflags, cs and rip. See a stack diagram in Figure 6-4. Notice how the segment selectors are padded with 64-bit zeros.

Figure 6-4. Stack when an interrupt handler starts

95

Chapter 6 ■ Interrupts and System Calls

Sometimes an interrupt handler needs additional information about the event. An interrupt error code is then pushed onto the stack. This code contains a lot of information specific to this type of interrupt. Many interrupts are described using special mnemonics in Intel's documentation. For example, interrupt number 13 is known as #GP (general protection).1 You will find a brief description of some interesting interrupts in Table 6-1. Table 6-1. Some important interruptions

VECTOR

mnemonic

DESCRIPTION

#FROM

division error

2

Non-maskable external interrupt

3

#BP

Breaking point

6

#UD

Invalid instruction opcode

8

#DF

A failure during interrupt handling

13

#GP

general protection

14

#PF

Page error

Not all binary code corresponds to correctly encoded machine instructions. When rip is not addressing a valid instruction, the CPU generates the #UD interrupt. The #GP interrupt is very common. It is generated when you try to dereference a prohibited address (which does not correspond to any assigned page), when you try to perform an action that requires a higher privilege level, etc. The #PF interrupt is generated when a page is addressed and has its current pointer cleared into the corresponding page table entry. This interrupt is used to implement the general file sharing and allocation mechanism. The interrupt handler can load missing pages from disk. Debuggers rely heavily on the #BP interrupt. When TF is set to rflags, the interrupt with this code is generated after the execution of each instruction, allowing the program to be executed step by step. Obviously, this interrupt is handled by an operating system. Therefore, it is the responsibility of an operating system to provide an interface to user agents that allows programmers to write their own debuggers. In summary, when the nth interrupt occurs, the following actions are performed from the programmer's point of view: 1. IDT address is obtained from idtr. 2. The interrupt descriptor is found from the 128 × nth byte of IDT. 3. The segment selector and controller address are loaded from the IDT entry in cs and rip, possibly changing the privilege level. The old ss, rsp, rflags, cs, and rip are stored on the stack as shown in Figure 6-4. 4. For some interrupts, an error code is placed on top of the handler's stack. Provides additional information about the cause of the outage. 5. If the descriptor's type field defines it as an interrupt port, the IF interrupt flag is cleared. However, Trap Gate does not automatically clear it, which allows handling of nested traps.

1

See section 6.3.1 of the third volume of [15]

96

Chapter 6 ■ Interrupts and System Calls

If the interrupt flag isn't cleared immediately after the interrupt handler starts, we can have no kind of guarantee that we'll execute even its first instruction without another interrupt popping up asynchronously and requiring our attention.

■■Question 97  Is the TF flag cleared automatically when inserting interrupt handlers? See [15]. The interrupt handler terminates with an iretq instruction, which resets all registers held on the stack, as shown in Figure 6-4, versus the simple call instruction, which resets just rip.

6.3 System Calls System calls are, as you know, functions that an operating system provides to user applications. This section describes the mechanism that allows it to run securely with a higher level of privileges. Mechanisms used to implement system calls vary on different architectures. In general, any instruction that results in an interrupt will do, for example, a divide by zero or any incorrectly coded instruction. The interrupt handler will be called and then the CPU will take care of the rest. In protected mode on Intel architecture, *nix operating systems used interrupt code 0x80. Every time a user executes int 0x80, the interrupt handler checks the contents of the register for the system call number and arguments. System calls are quite frequent and you cannot perform any interaction with the outside world without them. However, interrupts can be slow, especially on Intel 64 as they require IDT memory access. So in Intel 64 there is a new mechanism for making system calls, which uses syscall and sysret instructions to implement them. Compared to interrupts, this mechanism has some important differences: • The transition can only occur between ring 0 and ring 3. As ring 1 and ring 2 are little used by anyone, this limitation is not considered important. • Interrupt handlers differ, but all system calls are handled by the same code with a single entry point. • Some general purpose registers are now used implicitly during the system call. –– rcx is used to store old rips –– r11 is used to store old rflags

6.3.1 Model-specific registers Sometimes, when a new CPU appears, it has additional registers that the old ones do not have. Most often these are so-called model-specific registers. When these registers are rarely modified, their manipulation is done through two commands: rdmsr to read them and wrmsr to change them. These two commands operate on the record ID number. rdmsr accepts MSR number in ecx, returns registry value in edx:eax. wrmsr accepts the MSR number in ecx and stores the value obtained from edx:eax in it.

6.3.2 syscall and sysret The syscall instruction depends on several MSRs.

97

Chapter 6 ■ Interrupts and System Calls

• STAR (MSR number 0xC0000081), which contains two pairs of cs and ss values: for the system call handler and for the sysret instruction. Figure 6-5 shows its structure.

Figure 6-5. MSR STAR • LSTAR (MSR number 0xC0000082) contains the address of the system call handler (repull). • SFMASK (MSR number 0xC0000084) shows which bits in rflags should be cleared in the system call handler. The system call performs the following actions: • Load cs from STAR; • Change rflags in relation to SFMASK; • Save tear in rcx; and • Initialize rip with LSTAR value and get new STAR cs and ss. Note that we can now explain why system calls and procedures accept arguments on slightly different sets of records. Procedures accept their fourth argument in rcx, which, as we know, is used to store the previous pull value. Unlike interrupts, even if the privilege level changes, the controller itself must change the stack pointer. System call handling ends with the sysret instruction, which loads cs and ss from STAR and extracts from rcx. As we know, changing the segment selector leads to a GDT read to update its paired shadow register. However, when you run the syscall, these shadow registers are loaded with fixed values ​​and no GDT read is performed. Here are those two fixed values ​​in decrypted form: • Code segment shadow record: –– Base = 0 –– Threshold = FFFFFH –– Type = 112 (executable, accessible) –– S = 1 (System) – – DPL = 0 –– P = 1 –– L = 1 (long mode) –– D = 0 –– G = 1 (always the case in long mode)

98

Chapter 6 ■ Interrupts and System Calls

Also, CPL (Current Privilege Level) is set to 0 • Stack Segment Shadow Register: –– Base = 0 –– Limit = FFFFFH –– Type = 112 (can be executed, accessed) –– S = 1 (System) – – DPL = 0 –– P = 1 –– L = 1 (long mode) –– D = 1 –– G = 1 However, the system programmer is responsible for meeting one requirement: GDT must have the descriptors corresponding to these fixed values. Therefore, the GDT must store two specific descriptors for code and data specifically for syscall compatibility.

6.4 Summary In this chapter, we provide an overview of interrupts and system call mechanisms. We study its implementation right down to the system data structures residing in memory. In the next chapter, we'll review different models of computation, including Forth-like stack machines and finite automata, and finally, we'll work on a Forth interpreter and compiler in assembly language.

■■Question 98  What is an interrupt? ■■Question 99  What is RTD? ■■Question 100  What does the IF setting change? ■■Question 101  When does the #GP error occur? ■■Question 102  In what situations does the #PF error occur? ■■Question 103 How is the #PF error related to switching? How does the operating system use it? ■■Question 104  Can we implement system calls using interrupts? ■■Question 105  Why do we need a separate statement to implement system calls? ■■Question 106  Why does the interrupt handler need a DPL field? ■■Question 107  What is the purpose of interrupt stack tables?

99

Chapter 6 ■ Interrupts and System Calls

■■Question 108 Does a single-threaded application have only one stack? ■■Question 109  What types of input/output mechanisms does the Intel 64 provide? ■■Question 110  What is a model specific record? ■■Question 111  What are shadow records? ■■Question 112 How are model-specific registers used in the system call mechanism? ■■Question 113  What registers does the syscall instruction use?

100

CHAPTER 7

Computation Models In this chapter we will study two models of computation: finite state machines and stack machines. The calculation model is similar to the language you are using to describe the solution to a problem. Typically, a problem that is really difficult to solve correctly in one model of computation can be almost trivial in another. That's why programmers who know many different computing models can be more productive. They solve problems in the most suitable computational model and then implement the solution with the tools at their disposal. When trying to learn a new model of computing, don't think of it from the "old" point of view, like trying to think of finite state machines in terms of variables and assignments. Try to start over and logically build the new system of notions. We already know a lot about the Intel 64 and its computing model, derived from that of von Neumann. This chapter will introduce finite state machines (used to implement regular expressions) and stacked machines similar to the Forth machine.

7.1 Finite State Machines 7.1.1 Definition A deterministic finite state machine (deterministic finite automaton) is an abstract machine that acts on an input string, following some rules. We will use "finite automata" and "state machines" interchangeably. To define a finite automaton, the following parts must be provided: 1. A set of states. 2. Alphabet: a set of symbols that can appear in the input string. 3. A selected initial state. 4. One or more selected end states 5. Transition rules between states. Each rule consumes a token from the input string. Its action can be described as: "if the automaton is in state S and an input symbol C is produced, the next current state will be Z". If the current state does not have a rule for the current input symbol, we consider the automaton behavior to be undefined. Undefined behavior is a concept more familiar to mathematicians than engineers. For the sake of brevity, we are only describing the "good" cases. The "bad" cases are of no interest to us, so we are not defining the machine's behavior in them. However, when implementing these machines, we will consider all undefined cases as failing and leading to a special error state. © Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_7

101

Chapter 7 ■ Computational Models

Why bother with automatons? Some tasks are particularly easy to solve when this thinking paradigm is applied. These tasks include scanning embedded devices and searching for substrings that match a given pattern. For example, we are checking if a string can be interpreted as an integer. Let's draw a diagram, shown in Figure 7-1. It defines various states and shows possible transitions between them. • The alphabet consists of letters, spaces, digits and punctuation marks. • The set of states is {A, B, C}. • The initial state is A. • The final state is C.

Figure 7-1. Number recognition We start execution from state A. Each input symbol causes us to change the current state based on available transitions.

■■Note  Arrows marked with striped symbols like 0. . 🇧🇷 9 actually denote multiple rules. Each of these rules describes a transition for a single input character. Table 7-1 shows what will happen when this machine is run with an input string of +34. This is called execution tracing. Table 7-1. Trace of a finite state machine shown in Figure 7-1, input is: +34

OLD CONDITION

SCALE

NEW STATUS

ONE

+

B

B

3

C

C

4

C

The machine reached final state C. However, given an input idkfa, we could not have reached any state, because there are no rules for reacting to such input symbols. It is here that the behavior of the automaton is not defined. To make it total and always get to the yes or no state, we have to add one more final state and add rules in all existing states. These rules should direct execution to the new state if no previous rules match the input token.

102

Chapter 7 ■ Computational Models

7.1.2 Example: bit parity We receive a string of 0s and 1s. We want to know whether there is an even or odd number of ones. Figure 7-2 shows the solver in finite state machine form.

Figure 7-2. Is the number of ones even in the input string? The empty string has zero ones; zero is an even number. Because of this, state A is both the initial state and the final state. All zeros are ignored regardless of state. However, each one that occurs on the input changes the state to the opposite. If, given an input string, we arrive at the finite state A, then the number of ones is even. If we get to the finite state B, then it's odd.

■■Confusion In finite state machines, there is no memory, no allocations, and no if-then-else constructs. This is therefore a completely different abstract machine compared to von Neumann's. There's really nothing more to it than states and transitions between them. In the von Neumann model, state is the state of memory and register values.

7.1.3 Assembly Language Implementation Once you have designed a finite state machine to solve a specific problem, it is trivial to implement that machine in an imperative programming language such as assembler or C. The following is a simple way to implement such machines in assembly : 1 • Make the projected automaton complete: each state must have transition rules for any possible input symbol. If this is not the case, add a separate status to create a bug or "no" answer to the issue being resolved. For simplicity, let's call it the else rule. 2. Implement a routine to get an input token. Note that a symbol is not necessarily a character: it can be a network packet, a user action, and other types of global events. 3. For each state we must • Create a label. • Call the input read routine. • Match the input symbol with those described in the transition rules and jump to the corresponding states if they are the same. • Treat all other symbols by the else rule.

103

Chapter 7 ■ Computational Models

To implement the exemplary automaton in assembly, we first make it whole, as shown in Figure 7-3.

Figure 7-3. Check if the string is a number - a total automaton. We'll slightly modify this automaton to force the input string to be null-terminated, as shown in Figure 7-4. Listing 7-1 shows an implementation example.

Figure 7-4. Check if string is a number: a total automaton for a zero-terminated string Listing 7-1. automaton_example_bits.asm section .text; getymbol is a routine for ; read a symbol (eg from stdin); in al _A: call getymbol cmp al, '+' je _B cmp al, '-' je _B ; The character indices of the digits in ASCII; the tables fill a range from '0' = 0x30 to '9' = 0x39 ; This logic implements the transitions for the labels; _E and _C cmp al, '0' jb _E

104

Chapter 7 ■ Computational Models

cmp al, '9' ja _E jmp _C _B: call getymbol cmp al, '0' jb _E cmp al, '9' ja _E jmp _C _C: call getymbol cmp al, '0' jb _E cmp al, '9' ha _E test al, al jz _D jmp _C _D: ; code to notify about success _E:; Fault Notification Code This controller is reaching state D or E; control will pass to the instructions in the _D or _E label. Code can be isolated within a function that returns 1 (true) in the _D state or 0 (false) in the _E state.

7.1.4 Practical Value First, there is an important limitation: not all programs can be coded as finite state machines. This computational model is not Turing complete, it cannot parse recursively constructed complex text such as XML code. C and assembly language are Turing complete, which means they are more expressive and can be used to solve a wider range of problems. For example, if the length of the string is not limited, we cannot count its length or the words it contains. Each result would have been a state, and there are so many states in finite state machines, while the word count can be arbitrarily large, just like the strings themselves.

■■Question 114 Design a finite state machine to count the words in the input string. The input length is not more than eight symbols. Finite state machines are often used to describe embedded systems such as coffee machines. The alphabet consists of events (buttons pressed); input is a sequence of user actions.

105

Chapter 7 ■ Computational Models

Network protocols can also be described as finite state machines. Each rule can be annotated with an optional output action: "if an X symbol is read, change state to Y and print a Z symbol." Input consists of received packets and global events such as timeouts; the output is a sequence of sent packets. There are also several verification techniques, such as pattern checking, which allow testing certain properties of finite automata, for example, "if the automaton has reached state B, it will never reach state C". These tests can be of great value when building systems need to be highly reliable.

■■Question 115 Design a finite state machine to test whether there are an odd or even number of words in the input string. ■■Question 116 Design and implement a finite state machine to answer whether a string should be trimmed left, right, or both, or not trimmed at all. A string must be cut off if it begins or ends with consecutive spaces.

7.1.5 Regular expressions Regular expressions are a way to encode finite automata. They are often used to define textual standards for comparison. It can be used to search for occurrences of a specific pattern or to replace them. Your favorite text editor probably already implements them. There are several dialects of regular expressions. Let's take as an example a dialect similar to the one used in the egrep utility. A regular expression R can be: 1. A letter. 2. A string of two regular expressions: R Q. 3. Metasymbols ˆ and $, which match the start and end of the line. 4. A pair of grouping parentheses with a regular expression inside: (R). 5. An OR expression: R | Q. 6. R* denotes zero or more occurrences of R. 7. R+ denotes one or more occurrences of R. 8. R? denotes zero or one repetition of R. 9. A dot matches any character. 10. Brackets denote a range of symbols, for example [0-9] is equivalent to (0|1|2|3|4|5|6|7|8|9). You can test regular expressions using the egrep utility. It processes your pattern input and filters out only those lines that match a given pattern. To prevent the shell from processing the email, enclose it in single quotes like this: egrep 'expression'. The following are some examples of simple regular expressions: • hello .+ matches hello Frank or hello 12; does not match hello. • [0-9]+ matches an unsigned integer, possibly starting with zeros. • -?[0-9]+ matches a possibly negative integer, possibly starting with zeros. • 0|(-?[1-9][0-9]*) matches any integer that does not start with zero (unless it is zero).

106

Chapter 7 ■ Computational Models

These rules allow us to define a complex search pattern. The regular expression engine will try to match the pattern starting at each position in the text. Regular expression engines generally follow one of two approaches: • Use a simple approach, trying to match all described sequences of symbols. For example, matching a string ab with the regular expression aa?a?b might result in a sequence of events: 1. Attempted to match aaab: Failed. 2. Try to pair against aab: failed. 3. Try to pair against ab: success. Thus, we try different branches of decision until we find a successful one or until we definitively see that all options lead to failure. This approach is usually quite fast and also simple to implement. However, there is the worst case where the complexity starts to grow exponentially. Imagine combining a string: aaa...a (repeat n times) with a regular expression: a?a?a?...a?aaa...a (repeat a? n times, then repeat n times) string will certainly match the regular expression. However, using a simple approach, the engine will have to go through all possible strings that match this regular expression. To do so, it will consider two possible options for each of them. expression, that is, those that contain a and those that do not. There will be 2n such strings. It is as many as there are subsets in a set of n elements. You don't need more symbols than in this line of text to write a regular expression, which a modern computer will evaluate for days or even years. Even for a length n = 50, the number of options will reach 250 = 1125899906842624 options. These regular expressions are called "pathological" because, due to the nature of the matching algorithm, they are handled very slowly. • Construction of a finite state machine from a regular expression. It is usually an NFA (Nondeterministic Finite Automaton). Unlike DFA (Deterministic Finite Automaton), they can have multiple rules for the same state and input symbol. When such a situation occurs, the automaton performs both transitions and now has multiple states simultaneously. In other words, there is not a single state, but a set of states in which an automaton finds itself. This approach is a bit slower overall, but not worst case with exponential working time. Standard Unix utilities such as grep use this approach. How to build an NFA from a regular expression? The rules are quite simple: –– A character matches an automaton, which accepts a string of one of those characters, as shown in Figure 7-5. –– We can extend the alphabet with additional symbols, which we place at the beginning and end of each line.

107

Chapter 7 ■ Computational Models

Figure 7-5. NFA for a character –– Thus, we handle ˆ and $ like any other symbol. –– Grouping parentheses allow you to apply rules to groups of symbols. They are only used for correct parsing of regular expressions. In other words, they provide the necessary structural information for the correct construction of an automaton. –– OR corresponds to the combination of two NFAs by merging their initial states. Figure 7-5 illustrates the idea.

Figure 7-6. NFA combination via OR: An asterisk has a transition to itself and a special thing called the ε rule. This rule always occurs. Figure 7-7 shows the automaton for an expression a*b.

Figure 7-7. NFA: implementing asterisk --- ? is implemented similarly to *. R+ is encoded as RR*.

108

Chapter 7 ■ Computational Models

■■Question 117 Using whatever language you know, implement an analogue of grep based on the NFA construct. You can refer to [11] for additional information. ■■Question 118 Study this regular expression: ˆ1?$|ˆ(11+?)\1+$. What could your purpose be? Imagine that the input is a string consisting of just 1 characters. How does the result of this regular expression map to the length of the string?

7.2 Forth Machine Forth is a language created by Charles Moore in 1971 for the 11-meter radio telescope operated by the National Radio Astronomy Observatory (NRAO) at Kitt Peak, Arizona. This system worked on the first two minicomputers linked by a serial link. Both a multiprogrammed system and a multiprocessor system (in the sense that both computers shared responsibility for controlling the telescope and its scientific instruments), controlled the telescope, collected data, and supported an interactive graphics terminal for interacting with the telescope and analyzing data. recorded data. Today, Forth has a unique and interesting language, fun to learn and great for changing perspectives. It is still used, mainly in embedded software, due to its incredible level of interactivity. Forth can also be quite efficient. The fourth interpreter can be seen in places like • FreeBSD loader. • Robot firmware. • Integrated software (printers). • Spaceship software. Therefore, it is safe to invoke a systems programming language. It is not difficult to implement the Forth interpreter and compiler for Intel 64 in assembly language. The remainder of this chapter will explain the details. There are almost as many Forth dialects as there are Forth programmers; we will use our own simple dialect.

7.2.1 Architecture Let's start by studying an abstract Forth machine. It consists of a processor, two separate stacks for data and return addresses, and linear memory, as shown in Figure 7-8.

109

Chapter 7 ■ Computational Models

Figure 7-8. Fourth machine: The architecture stacks need not necessarily be part of the same memory address space. The Forth machine has a parameter called cell size. It is usually equal to the machineword size of the target architecture. In our case, the cell size is 8 bytes. The stack consists of elements of the same size. Programs consist of words separated by spaces or line breaks. Words are played consecutively. Whole words denote pushing the data stack. For example, to enter the numbers 42, 13 and 9 into the data stack, just type 42 13 9. There are three types of words: 1. Whole words, described above. 2. Native words, written in assembly language. 3. Colon words, written in Forth as a string of other Forth words. The return stack is needed to be able to return from words with a colon, as we'll see later. Most words manipulate the data stack. From now on, when we talk about the stack in Forth, we will implicitly consider the data stack unless otherwise specified. Words take their arguments from the stack and push the result there. All instructions that operate on the stack consume their operands. For example, the words +, -, *, and / consume two operands from the stack, perform an arithmetic operation, and return its result to the stack. A program 1 4 8 8 + * + calculates the expression (8 + 8) * 4 + 1. We will follow the convention that the second operand is popped off the stack first. This means that the program '1 2 -' evaluates to −1, not 1. The word : is used to define new words. It is followed by the name of the new word and a list of other words that end with the word ;. Both semicolons and semicolons are separate words and therefore must be separated by spaces. A square word, which takes an argument from the stack and pushes its square back, will look like this: sq dup * ; Every time we use sq in the program, two words will be executed: dup (duplicate cell at the top of the stack) and * (multiply two words at the top of the stack). To describe the actions of the word Forth, it is common to use stack diagrams: swap (a b -- b a)

110

Chapter 7 ■ Computational Models

In parentheses, you see the state of the stack before and after executing the word. Stack cells are names for highlighting changes to the stack's contents. Therefore, the exchange word exchanges two items from the top of the stack. The top element is on the right, so the 1 2 diagram corresponds to Next pressing first 1, then 2 as a result of executing a few words. rot puts it on top of the third number on the stack: rot(a b c -- b c a)

7.2.2 Plotting an Example Program Listing 7-2 shows a simple program to calculate the discriminant of a quadratic equation 1x2 + 2x + 3 = 0. Listing 7-2. forward_disk: sq dup *; : discr rot 4 * * swap sq swap - ; 1 2 3 discr Now let's run discr a b c step by step for some numbers a, b and c. The state of the stack at the end of each step is shown on the right. a( a ) b( a b ) c( a b c ) Then the word discr is executed. We're getting into it. rot( 4( *( *( swap ( sq( swap ( -(

b c a ) b c a 4 ) b c (a*4) ) b (c*a*4) ) (c*a*4) b ) (c*a*4) (b*b) ) (b*b) (c *a*4) ) (b*b - c*a*4) )

Now we do the same from the beginning, but for a = 1, b = 2 and c = 3. 1( 2( 3( rot( 4( *( *( swap ( sq( swap ( -(

1 ) 1 2 ) 1 2 3 ) 2 3 1 ) 2 3 1 4 ) 2 3 4 ) 2 12 ) 12 2 ) 12 4 ) 4 12 ) -8 )

111

Chapter 7 ■ Computational Models

7.2.3 Dictionary A dictionary is part of a Forth machine that stores word definitions. Each word is a heading followed by a sequence of other words. The header stores the link to the previous word (as in linked lists), the name of the word itself as a zero-terminated string, and some flags. We have already studied a similar data structure in the task, described in section 5.4. You can reuse much of your code to make defining new Forth words easier. See Figure 7-9 for the word header generated for the word discr described in section 7.2.2

Figure 7-9. Word header for discr

7.2.4 How words are implemented There are three ways to implement words. • Indirect thread code • Direct thread code • Subroutine thread code We are using a classic form of indirect thread code. This kind of code needs two special cells (which we can call Forth registers): PC points to the next Forth command. We will soon see that the Forth command is an address of an assembly implementation code address of the respective word. In other words, this is a pointer to executable assembly code with two levels of indirection. W is used in non-native words. When the word starts to play, this register points to its first word. These two records can be implemented through actual use of logging. Alternatively, its contents can be stored in memory. Figure 7-10 shows how words are structured when the indirect coding technique is used. It incorporates two words: a native word dup and a colon word square.

112

Chapter 7 ■ Computational Models

Figure 7-10. Indirect chaining code Each word stores the address of its native implementation (assembly code) immediately after the header. For words with two dots, the implementation is always the same: docol. The implementation is invoked using the jmp instruction. The execution token is the address of this cell, which points to an implementation. So an execution token is an address of an implementation word address. In other words, given the address A of a word entry in the dictionary, you can get its execution token simply by adding the total length of the header to A. Listing 7.3 provides an example dictionary. Contains two native words (starting with w_plus and w_dup) and a colon word (w_sq). Listing 7-3. section forward_dict_sample.asm .data w_plus: dq 0; The first word pointer to the previous word is zero db '+',0 db 0; No flags xt_plus:; Execution token for `plus`, equal to; the address of your implementation dq plus_impl w_dup: dq w_plus db 'dup', 0 db 0 xt_dup: dq dup_impl w_double: dq w_dup db 'double', 0 db 0 dq docol; The `docol` address -- one level of indirection dq xt_dup; Words with 'dup' start here.

113

Chapter 7 ■ Computational Models

dq xt_plus dq xt_exit last_word: dq w_double section .text plus_impl: pop rax add rax, [rsp] mov [rsp], rax jmp next dup_impl: push qword [rsp] jmp next The core of the Forth engine is the internal interpreter. It's a simple assembly routine that gets code from memory. It is shown in Listing 7-4. Listing 7-4. forward_next.asm next: mov add mov jmp

w, pc pc, 8 ; cell size is 8 bytes w,[w][w]

It does two things: 1. It reads memory starting from the PC and sets the PC to the next instruction. Remember, this PC points to a memory cell, which stores the execution token for a word. 2. Set W to the execution token value. In other words, after executing next, W stores the address of a pointer to the assembly's implementation of the word. 3. Finally, jump to the implementation code. Each native word implementation ends with the next jmp instruction. Guarantees that the next statement will be retrieved. To implement a colon, we need to use a stackback to save and restore the PC before and after a call. While W is not useful when running native words, it is quite important for colon words. Let's take a look at docol, the implementation of all colon words shown in Listing 7-5. It also features exit, another word designed to end all words with a colon. Listing 7-5. forward_docol.asm docol: sub mov add mov jmp

114

rstack, 8 [rstack], pc w, 8;8 pc, w seguinte

Chapter 7 ■ Computational Models

output: mov pc, [rstack] add rstack, 8 jmp next docol saves pc to back stack and sets new pc to first run token stored in current word. The return is made by exit, which restores the PC from the stack. This mechanism is similar to a pair of call/ret instructions.

■■Question 119 Read [32]. What is the difference between our approach (indirect chained code) and direct chained code and subroutine chained code? What advantages and disadvantages can you name? To better understand the concept of downstream code and the ins and outs of Forth, we've prepared a minimal example shown in Listing 7-6. It uses routines developed in the first task of Section 2.7. Take your time to run it (source code is shipped with the book) and check that it actually reads an input word and returns it. Listing 7-6. itc.asm %includes "lib.inc" global _start %define pc r15 %define w r14 %define rstack r13 section .bss resq 1023 rstack_start: resq 1 input_buf: resb 1024 section .text ; this cell is the main_stub program: dq xt_main ; 🇧🇷 🇧🇷 🇧🇷

The dictionary starts here. The first word is displayed complete. We then omit the flags and links between nodes for brevity. Each word stores an address of its assembly implementation.

🇧🇷 Discards the top element of the stack dq 0 ; No previous node db "drop", 0 db 0 ; Flags = 0 xt_drop: dq i_drop i_drop: add rsp, 8 jmp next

115

Chapter 7 ■ Computational Models

🇧🇷 Initialize xt_init registers: dq i_init i_init: mov rstack, rstack_start mov pc, main_stub jmp next ; Save PC when colon word starts xt_docol: dq i_docol i_docol: sub rstack, 8 mov [rstack], pc add w, 8 mov pc, w jmp next ; Returns from the colon word xt_exit: dq i_exit i_exit: mov pc, [rstack] add rstack, 8 jmp next ; Take a buffer pointer from the stack; Reads an input word and stores it; starting at the given buffer xt_word: dq i_word i_word: pop rdi call read_word push rdx jmp next ; Takes a pointer to a string from the stack; e prints xt_prints: dq i_prints i_prints: pop rdi call print_string jmp next ; Exit the program xt_bye: dq i_bye i_bye: mov rax, 60 xor rdi, rdi syscall ; Load default buffer address xt_inbuf: dq i_inbuf i_inbuf: push qword input_buf jmp next

116

Chapter 7 ■ Computational Models

🇧🇷 This is a two point word, stores; execution tokens. Each tab; corresponds to a fourth word to be; executed xt_main: dq i_docol dq xt_inbuf dq xt_word dq xt_drop dq xt_inbuf dq xt_prints dq xt_bye ; The inner performer. These three lines; looks for the next instruction and starts yours; next run: mov w, [pc] add pc, 8 jmp [w] ; The program starts execution from the initial _start word: jmp i_init

7.2.5 The Forth compiler can be run in interpreter or compiler mode. The interpreter simply reads the commands and executes them. When executing colon:word, Forth switches to compiler mode. Also, the colon: read the next word and use it to create a new dictionary entry with docol as the implementation. Forth then reads the words, puts them in the dictionary and adds them to the current word being defined. So we have to add another variable here, which stores the current position address for writing words in compile mode. Each write here will advance one cell. To exit compiler mode, we need special immediate words. They work no matter what mode we're in. Without them we could never get out of compiler mode. Immediate words are marked with an immediate flag. The interpreter pushes numbers onto the stack. The compiler cannot embed them directly in words, otherwise they will be treated as execution tokens. Attempting to launch a command using a 42 execution token will likely result in a segmentation fault. However, the solution is to use a special lighted word followed by the number itself. The purpose of lit is to read the next integer pointed to by PC and advance PC one cell so that PC never points to the embedded operand.

7.2.5.1 Forth Conditionals Let's highlight two words in our Forth dialect: branch n and 0branch n. They are only allowed in construction mode! They are similar to lit n in that the offset is stored immediately after their execution token.

117

Chapter 7 ■ Computational Models

7.3 Task: Forth compiler and interpreter This section will describe one big task: writing your own Forth interpreter. Before you start, make sure you understand the basics of the Forth language. If you're not sure, you can play around with any free Forth interpreter, like gForth.

■■Question 120 Look up the documentation for sete, setl and their equivalents. ■■Question 121  What does the cqo statement do? See [15]. It is convenient to store PC and W in some general-purpose registers, especially those that are guaranteed to survive function calls unchanged (saved calls): r13, r14, or r15.

7.3.1 Static Dictionary, Interpreter Let's start with a static dictionary of native words. Adapt the knowledge you received in section 5.4. As of now we cannot define new words at runtime. For this task, we'll use the following macro definitions: • native, which takes three arguments: –– Word name; –– A part of the word identifier; and –– Flags. Create and populate the header in .data and a label in .text. This tag will indicate the assembly code that follows the macro instance. Since most words don't use flags, we can overload the native to accept two or three arguments. To do this, we create a similar macro definition that accepts two arguments and natively starts with three arguments, the third is replaced by zero, and the first two are passed as-is, as shown in Listing 7-7. Listing 7-7. native_overloading.asm %macro native 2 native %1, %2, 0 %endmacro Compare two ways of defining the Forth dictionary: without macros (shown in Listing 7-8) and with them (shown in Listing 7-9). Listing 7-8. section forward_dict_example_nomacro.asm .data w_plus: dq w_mul ; previous db '+',0 db 0 xt_plus: dq plus_impl

118

Chapter 7 ■ Computational Models

section .text plus_impl: pop rax add [rsp], rax jmp next Listing 7-9. for_dict_example_macro.asm native '+', plus pop rax add [rsp], rax jmp next Next, define a colon macro, analogous to the previous one. Listing 7-10 shows its usage. Listing 7-10. forward_colon_usage.asm colon dq dq dq

'>', mayor exits xt_swap xt_less

Don't forget the docol address in every colon word! Next, create and test the following assembly routines: • find_word, which accepts a pointer to a null-terminated string and returns the address of the beginning of the word header. If no word by that name exists, zero is returned. • cfa (address code), which takes the beginning of the word header and skips the entire header until it reaches the XT value. Using these two functions and the ones you already wrote in Section 2.7, you can write an interpreter loop. The interpreter will push a number onto the stack or populate the special stub, which consists of two cells, shown in Listing 7-11. It should write the newly found execution token to program_stub. It should point to the PC at the beginning of the stub and jump to the next one. It will execute the word we just parsed and return control to the interpreter. Remember that an execution token is just an address of an assembly code address. That's why the second cell of the stub points to the third one, and the third one stores the address of the interpreter; we simply feed this data into the existing Forth machinery. Listing 7-11. forward_program_stub.asm program_stub: dq 0 xt_interpreter:dq .interpreter .interpreter: dq interpreter_loop Figure 7-11 shows pseudocode illustrating the logic of the interpreter.

119

Chapter 7 ■ Computational Models

Figure 7-11. Forth interpreter: pseudocode Remember that the Forth machine also has memory. Let's pre-allocate 65536 Forth cells to it.

■■Question 122 Should we assign these cells in the .data section or are there better options? In order for Forth to know where the memory is, let's create the word mem, which will simply put the starting address of the memory on top of the stack.

7.3.1.1 List of words You must first create an interpreter that supports the following words: • .S: prints the entire contents of the stack; doesn't change it. To implement it, save rsp before starting the interpreter. • Arithmetic: + - * /, = 3) { puts("X is greater than 3"); } else { puts("X is less than 3"); } Braces are optional. Without square brackets, only one statement will be considered part of each branch, as shown in Listing 8-6. Listing 8-6. if_no_braces.c if (x == 0) puts("X is zero"); else puts("X is not zero"); Note that there is a syntax error, called "dangling". Check Listing 8-7 and see if you can certainly assign the else branch to the first or second if. To resolve this disambiguation in the case of nested ifs, use curly braces. Listing 8-7. hanging_else.c if (x == 0)if (y == 0) { puts("A"); } else { puts("B"); } /* You may have considered one of the following interpretations. * The compiler may issue a warning to prevent this */ if (x == 0) { if (y == 0) { printf("A"); } else { put("B"); } } if (x == 0) { if (y == 0) { puts("A"); } } else { puts("B"); 🇧🇷

134

Chapter 8 ■ Basics

8.3.2 while A while statement is used to make a loop. Listing 8-8. while_example.c int x = 10; while ( x != 0 ) { puts("Hello"); x = x - 1; } If the condition is met, the body is executed. The condition is then checked one more time, and if it is true, the body is executed again, and so on. An alternative form of do... while (condition); allows checking the conditions after executing the body of the loop, thus guaranteeing at least one iteration. Listing 8-9 shows an example. Note that a body can be empty, as follows: while (x == 0);. The semicolon after the parentheses ends this statement. Listing 8-9. do_while_example.c int x = 10; do { printf("Hello\n");x = x - 1; } while ( x != 0 );

8.3.3 for A for is ideal for iterating over finite collections, such as linked lists or arrays. It has the following form: for (initializer; condition; step) body. Listing 8-10 shows an example. Listing 8-10. for_example.c int a[] = {1, 2, 3, 4}; /* an array of 4 elements */ int i = 0; for ( i = 0; i < 4; i++ ) { printf( "%d",a[i]) } First, the initializer is executed. Then there's a condition check, and if it's true, the body of the loop is executed, and then the step statement. In this case, the step declaration is an increment operator ++, which modifies a variable by increasing its value by one. After that, the loop starts again checking the condition and so on. Listing 8-11 shows two equivalent loops.

135

Chapter 8 ■ Basics

Listing 8-11. while_for_equiv.c int i; /* as a `while` loop */ i = 0; while (i < 10) { puts("Hello!"); I = I + 1; } /* like a `for` loop */ for( i = 0; i < 10; i = i + 1 ) { puts("Hello!"); } The break statement is used to terminate the loop early and move on to the next statement in the code. continue ends the current iteration and starts the next iteration immediately. Listing 8-12 shows an example. Listing 8-12. loop_count.c int n = 0; for( n = 0; n < 20; n++ ) { if (n % 2) continue; printf("%d is odd", n); } Also note that in the for loop, initialization, step, or condition expressions can be left blank. Listing 8-13 shows an example. Listing 8-13. infinite_for.c for( ; ; ) { /* this loop will repeat indefinitely unless `break` is issued in its body */ break; /* `break` is here, so we stop iterating */ }

8.3.4 goto A goto statement allows you to jump to a label within the same function. As with assembler, tags can mark any declaration, and the syntax is the same: tag: declaration. This is often described as bad coding style; however, it can be quite useful when coding finite state machines. What you shouldn't do is abandon conditionals and thoughtful loops for goto-spaghetti. The goto statement is sometimes used as a way to break multiple nested loops. However, this is often a symptom of bad design, because inner loops can be abstracted inside a function (thanks to compiler optimizations, probably at no runtime cost). Listing 8-14 shows how to use goto to exit all inner loops. Listing 8-14. goto.c int i; intj; for (i = 0; i < 100; i++ )

136

Chapter 8 ■ Basics

for( j = 0; j < 100; j++ ) { if (i * j == 432) go to end; else printf("%d * %d != 432\n", i, j ); } end: The goto statement combined with the imperative style makes analyzing the program's behavior more difficult for both humans and machines (compilers), so the cheesy optimizations that modern compilers are capable of doing become less likely and the code becomes harder to maintain. We recommend restricting the use of goto to code snippets that don't perform assignments, such as finite state machine implementations. This way you won't have to plot out all the possible execution paths of the program and how the values ​​of certain variables change when the program is executed one way or another.

8.3.5 switch A switch statement is used as if it were a nested multiple when the condition is that some integer variable is equal to one or another value. Listing 8-15 shows an example. Listing 8-15. case_example.c int i = 10; switch ( i ) { case 1: /* if i equals 1...*/ puts( "It's one" ); rest; /* Rest is required */ case 2: /* if i is equal to 2...*/ puts( "There are two" ); rest; default: /* otherwise... */ puts("Not one or two"); rest; } Each instance is actually a label. The cases are not limited by anything other than an optional break statement to exit the switch block. It allows for some interesting tricks.1 However, a forgotten pause is often a source of bugs. Listing 8-16 demonstrates these two behaviors: First, multiple labels are attached to the same instance, which means that no matter whether x is 0, 1, or 10, the executed code will be the same. Therefore, since the interrupt does not terminate this case, after the execution of the first printf, control will fall to the next statement named case 15, another printf. Listing 8-16. case_magic.c switch ( x ) { case 0: case 1: case 10: puts( "First case: x = 0, 1, or 10" ); 1 One of the most well-known hacks is called the Duff device and it incorporates a loop that is defined inside a switch and contains several boxes.

137

Chapter 8 ■ Basics

/* Note the absence of `break`! */ case 15: puts( "Second case: x = 0, 1, 10 or 15" ); rest; 🇧🇷

8.3.6 Example: Divisor Listing 8-17 shows a program that looks for the first divisor, which is then printed to stdout. The first_divisor function takes an argument n and finds an integer r from 1 exclusive to n inclusive, such that n is a multiple of r. If r = n, we obviously found a prime number. Notice how the statement after the for is not enclosed in square brackets because it is the only statement inside the loop. The same happened with the if field, which consists of a single return i. Of course, you can enclose it in square brackets, and some programmers recommend that. Listing 8-17. divisor.c #include int first_divisor( int n ) { int i; if ( n == 1 ) returns 1; for( i = 2; i = –– Bitwise operators: ∼ ˆ & | –– Assignment operators = += -= *= /= %= = &= ˆ= |= –– Miscellaneous operators: 1. sizeof (var) like “replace this with the size of var in bytes” 2. & like “get the address of an operand” 3. like “dereference this pointer” 4. ?: what is the ternary operator we talked about earlier 5. - > , which is used to refer to a structural or union type field.

141

Chapter 8 ■ Basics

Most operators have an obvious meaning. We'll mention some of the lesser used and more obscure ones. • The increment and decrement operators can be used in prefix or postfix form: for a variable i is either i++ or ++i. Both expressions will have an immediate effect on i, meaning it is incremented by 1. However, the value of i++ is the "old" i, while the value of ++i is the "new" i incremented. • There is a difference between logical operators and bitwise operators. For logical operators, any number other than zero has essentially the same meaning, while bitwise operations apply to each bit separately. For example, 2 && 4 equals zero, because there are no bits set for 2 & 4. However, 2 && 4 will return 1, because 2 & 4 are non-zero numbers (true values). • Logical operators are evaluated lazily. Consider the logical operator &&. When applied to two expressions, the first expression will be evaluated. If its value is zero, the calculation ends immediately, due to the nature of the AND operation. If any of its operands are zero, the result of the grand conjunction will also be zero, so there's no need to evaluate it further. It's important to us because this behavior is noticeable. Listing 8-24 shows an example where the program will output F and never execute the function g. Listing 8-24. logic_lazy.c #include int f(void) { puts( "F" ); return 0; } int g(void) { puts("G"); return 1; } int main(void) { f() && g(); return0; } • Tilde (∼) is a bitwise unary negation, hat (ˆ) is a bitwise binary xor. In the following chapters, we will review some of them, such as address handling and operand size.

8.5 Functions We can draw a line between procedures (which do not return a value) and functions (which return a value of a certain type). The procedure call cannot be incorporated into a more complex expression, unlike the function call. Listing 8-25 shows an exemplary procedure. Its name is myproc; returns void, so it returns nothing. It accepts two integer parameters called a and b. Listing 8-25. proc_example.c void myproc ( int a, int b ) { printf("%d",a+b); 🇧🇷

142

Chapter 8 ■ Basics

Listing 8-26 shows an example function. It accepts two arguments and returns a value of type int. A call to this function is used later as part of a more complex expression. Listing 8-26. function_example.c int myfunc ( int a, int b ) { return a + b; } int other( int x ) { return 1 + myfunc( 4, 5 ); } The execution of each function ends with a return statement; otherwise, the value it returns is undefined. Procedures can have the return keyword omitted; it can still be used without an operand to immediately return from the procedure. When there are no arguments, the void keyword must be used in the function declaration, as shown in Listing 8-27. Listing 8-27. no_arguments_ex.c int always_return_0( void ) { return 0; } The body of the function is a block declaration, so it is enclosed in braces and does not end with a semicolon. Each block defines a lexical scope for the variables. All variables must be declared at the beginning of the block, before any declarations. This restriction is present in C89 but not in C99. We're sticking with it to make the code more portable. In addition, it requires a certain amount of self-discipline. If you have a large number of local variables declared at the beginning of the scope it will look confusing. At the same time, it is usually a sign of poor program decomposition and/or poor choice of data structures. Listing 8-28 shows examples of good and bad variable declarations. Listing 8-28. block_variables.c /* Good */ void f(void) { int x; ... } /* Incorrect: `x` is declared after calling `printf` */ void f(void) { int y = 12; printf("%d", y); int x = 10; ... } /* Incorrect: `i` cannot be declared in the `for` initializer */ for( int i = 0; i < 10; i++ ) { ... }

143

Chapter 8 ■ Basics

/* Good: `i` is declared before `for` */ int f(void) { int i; for( i = 0; i < 10; i++ ) { ... } } /* Good: any block can have additional variables declared at the beginning */ /* `x` is local to a `for` iteration and is always reset to 10 */ for( i = 0; i < 10; i++ ) { int x = 10; } If a variable in a given scope has the same name as a variable already declared in a higher scope, the newer variable hides the old one. There is no way to address the hidden variable syntactically (not storing its address somewhere and using the address). Of course, local variables in different functions can have the same names.

■■Note  Variables are visible until the end of their respective blocks. So a commonly used notion of 'local' variables is actually local block, not local function. The rule of thumb is: make variables as local as possible (including local variables for loop bodies, for example). It considerably reduces program complexity, especially in large projects.

8.6 Preprocessor The C preprocessor acts similarly to the NASM preprocessor. Its power, however, is much more limited. The most important preprocessor directives you will see are: • #define • #include • #ifndef • #endif The #define directive is very similar to its NASM equivalent %define. It has three main uses. • Defining global constants (see Listing 8-29 for an example). Listing 8-29. define_example1.c #define MY_CONST_VALUE 42 • Defining parameterized macro substitutions (as shown in Listing 8-30).

144

Chapter 8 ■ Basics

Listing 8-30. define_example2.c #define MACRO_SQUARE( x ) ((x) * (x)) • Definition of flags; depending on which, some additional code may be included or excluded from sources. It is important to enclose all occurrences of arguments in macro definitions in parentheses. The reason behind this is that C macros are not syntactic, which means that the preprocessor does not know the structure of the code. This sometimes results in unexpected behavior, as shown in Listing 8-31. Listing 8-32 shows the preprocessed code. Listing 8-31. define_parentheses.c #define SQUARE( x ) (x * x) int x = SQUARE( 4+1 ) As you can see, the value of x will not be 25, but 4+(1∗4)+1 because the multiplication has a higher priority compared to addition. Listing 8-32. define_parentheses_preprocessed.c int x = 4+1 * 4+1 The #include directive pastes the contents of the given file in place. The file name is enclosed in quotes (#include "file.h") or in square brackets (#include ). • In the case of angle brackets, the file is searched in a set of predefined directories. For GCC this is usually: –– /usr/local/include –– /gcc/target/version/include Here represents the directory containing the libraries (a GCC configuration) and is usually /usr/lib or /usr/local /lib by default. –– /usr/target/include –– /usr/include

Using the -I key, you can add directories to this list. You can create a special include/ directory in the root of your project and add it to GCC's include search list.

• In the case of citations, files are also searched in the current directory. You can get the preprocessor output by evaluating a filename.c in the same way as when working with NASM: gcc -E filename.c . This will execute all the preprocessor directives and dump the results to stdout without doing anything.

145

Chapter 8 ■ Basics

8.7 Summary In this chapter we cover the basic concepts of C. All variables are labels in the memory of the abstract machine in the C language, whose architecture closely resembles the von Neumann architecture. After describing a universal program structure (functions, data types, global variables, . . . ), we define two syntactic categories: declarations and expressions. We saw that expressions are l-values ​​or r-values, and we learned how to control program execution using function calls and control statements like if and while. We are already able to write simple programs that perform calculations with integers. In the next chapter, we'll look at the C type system and types in general to get a broader idea of ​​how types are used in different programming languages. Thanks to the notion of matrices, our possible input and output data will be much more diverse.

■■Question 148  What is a literal? ■■Question 149  What are lvalue and rvalue? ■■Question 150  What is the difference between statements and expressions? ■■Question 151  What is an instruction block? ■■Question 152 How do you define a preprocessor symbol? ■■Question 153  Why is it necessary to break at the end of each distribution box? ■■Question 154 How are true and false values ​​encoded in C89? ■■Question 155  What is the first argument of the printf function? ■■Question 156 printf check the types of its arguments? ■■Question 157  Where can you declare variables in C89?

146

CHAPTER 9

Type System The notion of type is one of the keys. A type is essentially a label assigned to a data entity. Each data transformation is defined for specific data types, which are guaranteed to be correct (you wouldn't want to add the number of active Reddit users to the average midday temperature in the Sahara, because it doesn't make sense). This chapter will study the Type-C system in depth.

9.1 Basic C Type System All types in C fall into one of these categories: • Predefined numeric types (int, char, float, etc.). • Arrays, multiple elements of the same type occupying consequent memory cells. • Pointers, which are essentially cells that store the addresses of other cells. The pointer type encodes the type of cell it points to. A particular case of pointers are function pointers. • Structures, which are packets of data of different types. For example, a structure can store an integer and a floating point number. Each of the data elements has its own name. • Enumerations, which are essentially integers, take one of the explicitly defined values. Each of these values ​​has a symbolic name to refer to. • Functional types. • Constant types, building on some other type and making the data immutable. • Type aliases for other types.

9.1.1 Numeric Types The most basic C types are numeric. They have different sizes and are signed or unsigned. Due to a long and loosely controlled evolution of the language, its description can sometimes seem mysterious and often very ad hoc. The following is a list of basic types: 1. char • Can be signed and unsigned. By default, this is usually a signed number, but it is not required by the language standard. • Its size is always 1 byte;

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_9

147

Chapter 9 ■ Type system

• Although the name makes a direct reference to the word "character", this is an integer type and should be treated as such. It is usually used to store the ASCII code of a character, but it can be used to store any 1-byte number. • A literal 'x' y corresponds to an ASCII code for the character “x”. Its type is int, but it's safe to assign it to a variable of type char.1 Listing 9-1 shows an example. Listing 9-1. char_example.c number of characters = 5; char symbol_code = 'x'; char null_terminator = '\0'; 2. int • An integer number. • Can be signed and unsigned. It is signed by default. • You can have an alias simply like: signed, int signed (similar to unsigned). • Can be short (2 bytes), long (4 bytes on 32-bit architectures, 8 bytes on Intel 64). Most compilers also support long long, but until C99 it wasn't part of the standard. • Other aliases: abbreviated, abbreviated int, abbreviated int with signature, abbreviated int with signature. • The size of int without modifiers varies by architecture. It is designed to be equal to the machine word size. In the 16-bit era the size of int was obviously 2 bytes, on 32-bit machines it is 4 bytes. Unfortunately, that hasn't stopped programmers from relying on an int of size 4 in the age of 32-bit computing. Due to the large amount of software that would break if we changed the size of the int, its size remains intact and is still 4 bytes. • It is important to note that all integer literals have the standard int format. If we add the L or UL suffixes, we explicitly indicate that these numbers are of type long int or unsigned int. Sometimes it is very important not to forget these suffixes. Consider an expression 1 e ) return 1; otherwise, it returns 0; } What happens if cur > max? This implies that the difference between cur and max is negative. Its type is ptrdiff_t. Comparing it with an unsigned int type value is an interesting case to study. ptrdiff_t has as many bits as the address on the target architecture. Let's study two cases: • 32-bit system, where sizeof( unsigned int ) ==4 and sizeof( ptrdiff_t ) ==4. In this case, the types in our comparison will undergo these conversions. int < unsigned int (unsigned int) size int = size; memset(array->array, 0, size); return array; 🇧🇷

11.3.1 C syntax details allows us to define multiple variables in one line. int a, b = 4, c; However, to declare multiple pointers, you must add an asterisk before each pointer. Listing 11-14 shows an example: a and b are pointers, but the type of c is int. Listing 11-14. ptr_mult_decl.c int* a, *b, c; This rule can be circumvented by creating a type alias for int* using typedef, hiding an asterisk. Defining multiple variables on one line is generally discouraged as it makes the code harder to read in most cases. It is possible to create quite complex type definitions by mixing function pointers, arrays, pointers, etc. You can use the following algorithm to decrypt them: 1. Look for an identifier and start from there. 2. Go right to the first closing parenthesis. Find your match on the left. Interprets an expression between these parentheses. 3. Move up one level in relation to the expression we analyzed in the previous step. Find the outer brackets and repeat step 2.

210

Chapter 11 ■ Memory

We'll illustrate this algorithm in an example shown in Listing 11-15. Table 11-1 describes the analysis process. Listing 11-15. complex_decl_1.c int* (* (*fp) (int) ) [10]; Table 11-1. Complex definition of analysis

Expression

Interpretation

fp

First identifier.

(*pf)

it is a pointer.

(* (*fp) (int))

A function that accepts int and returns a pointer...

int* (* (*fp) (int)) [10]

...to an array of ten pointers to int

As you can see, the process of deciphering complex statements is not an easy one. It can be simplified by using typedefs for parts of declarations.

11.4 String Literals Any sequence of char elements terminated by a null terminator can be seen as a string in C. Here, however, we want to talk about immediately encoded strings, that is, string literals. Most string literals are stored in .rodata if they are long enough. Listing 11-16 shows an example of a string literal. Listing 11-16. str_lit_example.c char* str = "when the music ends, turn off the lights"; str is just a pointer to the first character of the string. According to the language standard, string literals (or pointers to strings thus created) cannot be altered.1 Listing 11-17 shows an example. Listing 11-17. string_literal_mut.c char* str = "hello world abcdefghijkl"; /* the following line produces a runtime error */ str[15] = '\''; In C++, string literals are of type char const* by default, reflecting their immutable nature. Consider using variables of type char const* whenever you can, when the strings you are dealing with must not mutate. The constructs shown in Listing 11-18 are also correct, although you'll probably never use the second one.

1

To be precise, the result of such an operation is not well defined.

211

Chapter 11 ■ Memory

Listing 11-18. str_lit_ptr_ex.c char will_be_o = "hello world!"[4];/* is 'o' */ char const* tail = "abcde"+3 ; /* is "from", skipping 3 symbols */ When manipulating strings, there are several common scenarios based on where the string is assigned. 1. We can create a string between global variables. It will be mutable and under no circumstances will it be duplicated in the constant data region. Listing 11-19 shows an example. Listing 11-19. str_glob.c char str[] = "global_something"; void f (void) { ... } In other words, it's just a global array initialized with character codes. 2. We can create a string on a stack, in a local variable. Listing 11-20 shows an example. Listing 11-20. str_loc.c void func(void) { char str[] = "some_local"; } However, the "something_local" string itself must be kept somewhere because local variables are initialized every time the function starts and we need to know the values ​​they should be initialized with. In the case of relatively short strings, the compiler will try to line them up in the instruction stream. Apparently, for smaller strings, it's smarter to just split them into 8-byte blocks and run mov instructions with each block as an immediate operand. Long strings, however, are best stored in .rodata. The instruction, shown in Listing 11-20, will allocate enough bytes on the stack and then make a read-only copy of the data to this local stack buffer. 3. We can dynamically allocate a string via malloc. The string.h header file contains some very useful functions like memcpy which is used for fast copying. Listing 11-21 shows an example. Listing 11-21. str_malloc.c #include #include int main( int argc, char** argv ) { char* str = (char*)malloc( 25 ); strcpy(str, "wow, what a nice string!"); free(string); 🇧🇷

212

Chapter 11 ■ Memory

■■Question 210  Why do we allocate 25 bytes for a 24 character string? ■■Question 211 Read man for functions: memcpy, memset, strcpy.

11.4.1 String Interning “String internation” is a term most commonly used by Java or C# programmers. However, in reality, something similar happens in C (but only at compile time). The compiler tries to avoid duplicating strings in the read-only data region. This means that, generally, the same addresses will be assigned to all three variables in the code shown in Listing 11-22. Listing 11-22. str_intern.c char* best_guitar_solo= "Fifth Squad"; char* good_genesis_song = "Fifth Estuary"; char* best_1973_live = "Fifth Estuary"; String internalization would be impossible if string literals were not rewrite protected. Otherwise, by changing these strings in one place in a program, we are introducing an unpredictable change to data used elsewhere, as they both share the same copy of the string.

11.5 Data Models We talked about the sizes of different types of integers. The language standard imposes a set of rules such as "the length of the long is not smaller than the length of the short" or "the length of the signed short must be such that it can contain values ​​in the range −216 . 🇧🇷 🇧🇷 216 -1" The last rule, however, does not give us a fixed size, because short could be 8 bytes wide and still satisfy this restriction, so these requirements are far from setting the exact sizes in stone. By systematizing different sets of sizes, conventions called data models were created. Each of these defines sizes for basic types. Figure 11-2 shows some notable data models that might be of interest to us.

Figure 11-2. data models

213

Chapter 11 ■ Memory

As we chose 64-bit GNU/Linux system for study purposes, our data model is LP64. When developing for 64-bit Windows system, the length size will be different. Everyone wants to write portable code that can be reused across different platforms, and luckily there is a standard way to never run into data model changes. Prior to C99, it was common practice to create a set of type aliases in int32 or uint64 format and use them exclusively throughout the program rather than constantly changing ints or longs. In case the target architecture changed, the type aliases were easy to fix. However, it created chaos because everyone created their own set of types. C99 introduced platform independent types. To use them, just include a stdint.h header. Gives access to different types of fixed-length integers. Each of them has a form: • u, if the type is unsigned; • In t; • Size in bits: 8, 16, 32 or 64; and T. For example, uint8_t, int64_t, int16_t. The functions in the printf function (and similarly formatted input/output) were given similar treatment by introducing special macros to select the correct format specifiers. They are defined in the inttypes.h file. In common cases, you want to read or write integers or pointers. Then the name of the macro will be formed as follows: • PRI for output (printf, fprintf, etc.) or SCN for input (scanf, fscanf, etc.). • Format specifier: –– d for decimal format. –– x for hexadecimal format. –– or to octal format. –– u for unsigned integer formatting. –– i for integer format. • Additional information includes one of the following: –– N for N-bit integers. –– PTR for pointers. –– MAX for the maximum supported bit size. –– FAST is the defined implementation. We need to take advantage of the fact that multiple string literals, delimited by spaces, are automatically concatenated. The macro will output a string containing a correct format specifier, which will be concatenated with whatever is around it. Listing 11-23 shows an example.

214

Chapter 11 ■ Memory

Listing 11-23. inttypes.c #include #include void f( void ) { int64_t i64 = -10; uint64_t u64 = 100; printf("64-bit signed integer:%" PRIi64 "\n", i64); printf("Unsigned 64-bit integer: %" PRIu64 "\n", u64); } See section 7.8.1 of [7] for a complete list of these macros.

11.6 Dataflows The standard C library gives us a platform-independent way of working with files. It abstracts files as streams of data, which we can read and write to. We have seen how files are handled in Linux at the system call level: the open system call opens a file and returns its descriptor, an integer, the write and read system calls are used to perform writing and reading, respectively, and the close system call ensures that the file is closed properly. Since the C language was created alongside the Unix operating system, they have the same approach to file interactions. The library counterparts of these functions are named fopen, fwrite, fread, and fclose. On Unix-like systems, they act as an adapter for system calls, providing similar functionality, except that they work in the same way on other platforms. The main differences are as follows: 1. Instead of file descriptors, we use a special type of FILE, which stores all information about a given stream. Its implementation is hidden and you should never change its internal state manually. Therefore, instead of working with numerical file descriptors (which are platform dependent), we use FILE as a black box. The FILE instance is allocated on the heap internally by the C library itself, so at any given time we will be working with a pointer to it, not the instance itself directly. 2. Although Unix file operations are more or less uniform, there are two types of data streams in C. • Binary streams consist of raw bytes that are treated "as is". • Text streams include symbols grouped into lines; each line ends with an end-of-line character (implementation dependent). Text streams are limited in various ways on some systems. • Line length can be limited. • They can only work with print characters, line breaks, spaces and tabs. • Spaces before the new line can disappear. On some operating systems, text and binary streams use different file formats, so to work with a text file in a way that is compatible between all your programs, the use of text streams is mandatory. While the GNU C library, usually associated with GCC, does not make any difference between binary and text streams, on other platforms this is not the case, so it is crucial to distinguish between them.

215

Chapter 11 ■ Memory

For example, I saw a situation where reading a large chunk of an image file on Windows (compiler was MSVC) would terminate prematurely because the image was obviously binary while the associated stream was created in text mode. The standard library provides machines for creating and working with streams. Some functions it defines should only be used on text streams (like fscanf). The relevant header file is called stdio.h. Let's look at the example shown in Listing 11-24. Listing 11-24. file_example.c int smth[]={1,2,3,4,5}; FILE* f = fopen( "hello.img", "rwb" ); fread(algo, sizeof(int), 1, f); /* This line is optional. Using the `fseek` function we can browse the file */ fseek( f, 0, SEEK_SET ); fwrite(something, 5 * size(int), 1, f); fclose(f); • The FILE instance is created through a call to the fopen function. The latter accepts the path to the file and a set of flags, compressed into a string. Important fopen flags are listed here. –– b - a pen file in binary mode. This is what makes a real distinction between text and binary streams. By default, files are opened in text mode. –– w - opens a stream with the ability to write to it. –– r - opens a stream with the ability to read from it. –– + - If you just type w, the file will be replaced. When + is present, writes append data to the end of the file. If the file does not exist, it will be created. The hello.img file opens in binary mode for reading and writing. The file content will be replaced. • Once created, the FILE has a kind of pointer to a position within the file, a kind of cursor. Reads and writes move this cursor further. • The fseek function is used to move the cursor without performing reads or writes. Lets you move the cursor relative to its current position or the beginning of the file. • Functions fwrite and fread are used to write and read data from the open FILE instance. Taking fread for example, it accepts the memory buffer for reading. The two integer parameters are the size of an individual block and the number of blocks read. The return value is the number of blocks successfully read from the file. The reading of each block is atomic: either it is read completely or it is not read at all. In this example, the block size is equal to the size of (int) and the number of blocks is one. The use of fwrite is symmetric. • fclose should be called when work with the file is complete.

216

Chapter 11 ■ Memory

There is a special constant EOF. When returned by a function that works with a file, it means that the end of the file has been reached. Another constant, BUFSIZ, stores the buffer size that works best in the current environment for input and output operations. Streams can use buffering. This means they have an internal buffer representing all reads and writes. It allows rarer system calls (which are costly in terms of performance due to context switching). Sometimes when the buffer is full, recording will trigger a recording system call. A buffer can be flushed manually using the fflush command. Any delayed writes will be executed and the buffer will be reset. When the program starts, three FILE* instances are created and attached to streams with descriptors 0, 1, and 2. They can be named stdin, stdout, and stderr. All three normally use a buffer, but stderr automatically flushes the buffer after each write. It is necessary not to delay or miss error messages.

■■Note Again, descriptors are integers, FILE instances are not. The function int fileno( FILE* stream ) is used to get the underlying descriptor of the file stream.

■■Question 212 Read man for functions: fread, fread, fwrite, fprintf, fscanf, fopen, fclose, fflush. ■■Question 213  Investigate and find out what will happen if the fflush function is applied to a bidirectional stream (opened for both reading and writing) when the last action on the stream was before it was read.

11.7 Task: Higher Order Functions and Lists 11.7.1 Common Higher Order Functions In this task, we are going to implement several higher order functions on linked lists, which should be familiar to those used in the functional programming paradigm. These functions are known by the names foreach, map, map_mut and foldl. • foreach accepts a pointer to the beginning of the list and a function (which returns void and accepts an int). Launches the function on each item in the list. • map accepts a function f and a list. Returns a new list containing the results of f applied to all elements in the source list. The font list is not affected. For example, f(x) = x + 1 will map the list (1, 2, 3) to (2, 3, 4). • map_mut does the same, but changes the source list. • foldl is a bit more complicated. Accepts: –– The initial value of the accumulator. –– A function f(x, a). –– A list of items. Returns a value of the same type as the accumulator, calculated as follows: 1. We cast f into the accumulator and into the first element of the list. The result is the new value of the accumulator a′. 2. We cast f at a′ and at the second element of the list. The result is again the new value of the accumulator a''.

217

Chapter 11 ■ Memory

3. We repeat the process until we consume the list. In the end, the final accumulator value is the final result. For example, let's consider f(x, a) = x * a. Starting foldl with the accumulator value 1 and this function, we will calculate the product of all elements in the list. • iterate accepts the initial value s, the length of the list n, and the function f. It then generates a list of length n as follows:

(

)

 s , f ( s ) , f ( f ( s ) ) , f f ( f ( s ) ) …   The functions described above are called higher-order functions, because they accept other functions as arguments. Another example of such a function is the qsort array sorting function. void qsort( void *base, size_t nmemb, size_t size, int (*compar)(const void *, const void *)); It accepts the array starting address base, the number of elements, the size of individual elements, and the comparison function compar. This function is the decision maker indicating which of the given elements should be closer to the beginning of the array.

■■Pregunta 214 Lea man qsort.

11.7.2 Assignment The input contains an arbitrary number of integers. 1. Store these integers in a linked list. 2. Transfer all the functions written in the above mapping to separate .h and c files. Don't forget to put on a protector included! 3. Implement foreach; using it, send the initial list to stdout twice: the first time, separate the elements with spaces, the second time, send each element on the new line. 4. Implement map; using it, it generates the squares and cubes of the numbers in the list. 5. Implement folding; using it, it generates the sum and minimum and maximum element of the list. 6. Implement map_mut; With it, it generates the modules of the input numbers. 7. Implement iterate; using it, create and print the list of powers of two (first 10 values: 1, 2, 4, 8, …). 8. Implement a function bool save(struct list* lst, const char* filename); which will write all list items to a text file with filename. It should return true if the write is successful, false otherwise. 9. Implement a function bool load(struct list** lst, const char* filename); which will read all integers from a text file and write the list stored in *lst. It should return true if the write is successful, false otherwise.

218

Chapter 11 ■ Memory

10. Save the list to a text file and reload it using the two functions above. Check that saving and loading are correct. 11. Implement a function bool serialize(struct list* lst, const char* filename); which will write all list items to a binary file file name. It should return true if the write is successful, false otherwise. 12. Implement a function bool deserialize(struct list** lst, const char* filename); which will read all integers from a binary filename and write the saved list to *lst. It should return true if the write is successful, false otherwise. 13. Serialize the list into a binary file and reload it using the two functions above. Check that the serialization and deserialization are correct. 14. Free all allocated memory. You will need to learn how to use • Function pointers. • limits.h and its constants. For example, to find the minimum element in an array, you would use foldl with the maximum possible int value as the accumulator and a function that returns at least two elements. • The static keyword for functions that you only want to use in one module. You are guaranteed that • The input stream contains only integers separated by white spaces. • All input numbers can be contained as int. It's probably convenient to write a separate function to read a list of FILE. The solution occupies about 150 lines of code, not counting the functions defined in the previous exercise.

■■Question 215  In languages ​​like C#, code like the following is possible: var count = 0; mylist.Foreach(x=>count+=1);

Here, we throw an anonymous function (that is, a function that has no name, but whose address can be manipulated, for example passed to another function) for each item in a list. The function is written as x => count += 1 and is equivalent to void no_name( int x ) { count += 1; 🇧🇷

What's interesting about this is that this function knows some of the caller's local variables and can therefore modify them. Can you rewrite the forall function so that it accepts a pointer to some sort of "context", which might contain an arbitrary number of variable addresses, and then pass the context to the called function for each element?

219

Chapter 11 ■ Memory

11.8 Summary In this chapter we study the memory model. We gained a better understanding of type dimensions and data models, studied pointer arithmetic, and learned how to decipher complex type declarations. In addition, we saw how to use standard library functions to perform input and output. We practice this by implementing several higher-order functions and doing some file input and output. We will further deepen our understanding of memory design in the next chapter, where we will detail the difference between the three "facets" of a language (syntax, semantics, and pragmatics), explore the notions of undefined and unspecified behavior, and show why the alignment of the data is important.

■■Question 216  What arithmetic operations can you perform with pointers and under what conditions? ■■Question 217  What is the purpose of void*? ■■Question 218  What is the purpose of NULL? ■■Question 219  What is the difference between 0 in the pointer context and 0 as an integer value? ■■Question 220  What is ptrdiff_t and how is it used? ■■Question 221  What is the difference between size_t and ptrdiff_t? ■■Question 222  What are first class objects? ■■Question 223 Are functions first-class objects in C? ■■Question 224  What data regions does abstract machine C contain? ■■Question 225  Is the constant data region generally hardware write protected? ■■Question 226  What is the connection between pointers and arrays? ■■Question 227  What is dynamic memory allocation? ■■Question 228  What is the operator size? When is it counted? ■■Question 229  When are string literals stored in .rodata? ■■Question 230  What is current entrainment? ■■Question 231  What data model are we using? ■■Question 232  Which header contains platform independent types? ■■Question 233 How can we concatenate string literals at compile time? ■■Question 234  What is dataflow? ■■Question 235  Is there a difference between a data stream and a descriptor? ■■Question 236 How do we get the stream descriptor? ■■Question 237 Are there streams open when the program starts? ■■Question 238  What is the difference between binary and text streams? ■■Question 239 How do we open a binary stream? A stream of text? 220

CHAPTER 12

Syntax, Semantics, and Pragmatics In this chapter, we'll review the very essence of what a programming language is. These fundamentals will allow us to better understand the structure of the language, the behavior of the program and the translation details that you must take into account.

12.1 What is a programming language? A programming language is a formal computer language designed to describe algorithms in a machine-understandable way. Each program is a sequence of characters. But how do we differentiate the shows from every other network? We need to define language somehow. The crude way is to say that the compiler itself is defining the language, parsing programs and translating them into executable code. This approach is bad for several reasons. What do we do with compiler errors? Are they really bugs or do they affect the language setting? How do we write other compilers? Why should we mix language definition and implementation details? Another way is to provide a cleaner, implementation-independent way of describing the language. It is quite common to see three facets of the same language. • The rules for constructing statements. Frequently, the description of correctly structured programs is done through formal grammars. These rules form the syntax of the language. • The effects of each language construct on the abstract machine. This is the semantics of the language. • In any language there is also a third aspect, called pragmatics. Describes the influence of real-world implementation on program behavior. –– In some situations, the default language does not provide enough information about program behavior. It is then entirely up to the compiler to decide how it will translate this program, which is why it often assigns specific behavior to these programs. For example, in calling f(g(x), h(x)) the order of evaluation of g(x) and h(x) is not defined by default. We can calculate g(x) and then h(x), or vice versa. But the compiler will pick a certain order and generate instructions that will do the calls in exactly that order. –– Sometimes there are different ways to translate language constructs into target code. For example, do we want to prohibit the compiler from adding certain features, or are we going to stick with the laissez-faire approach? In this chapter we will explore these three facets of languages ​​and apply them to C.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_12

221

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.2 Syntax and formal grammars First of all, a language is a subset of all possible strings that we can construct from a given alphabet. For example, a language of arithmetic expressions has an alphabet Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, −, ×, /, .}, assuming only these four operations are used and the dot separates an integer part. Not all combinations of these symbols form a valid string; for example, +++++ is not a valid phrase in this language. Formal grammars were first formalized by Noam Chomsky. They were created in an attempt to formalize natural languages ​​such as English. According to them, sentences have a structure similar to a tree, where the leaves are a kind of "basic blocks" and from them (and other complex parts) more complex parts are built according to some rules. All these primitive and composite parts are usually called symbols. Atomic symbols are called terminals and complexes are non-terminal. This approach was taken to build synthetic languages ​​with very simple grammars (compared to natural languages). Formally, a grammar consists of • A finite set of terminal symbols. • A finite set of nonterminal symbols. • A finite set of production rules, which contain information about the structure of the language. • A start symbol, a non-terminal one that will match any correctly constructed language statement. It is a starting point for analyzing any statement. The class of grammars we are interested in has a very particular form of production rules. Each of them looks like ::= sequence of terminals and non-terminals. As we see, this is exactly the description of a non-terminal complex structure. We can write several possible rules for the same nonterminal and the appropriate one will be applied. To make it less verbose, we'll use the notation with the | to denote "or", just like in regular expressions. This way of describing grammar rules is called BNF (Backus-Naur form): terminals are denoted using strings enclosed in quotes, production rules are written using ::= characters, and non-terminal names are written within parentheses. Sometimes it is also very convenient to introduce a terminal ϵ which, during parsing, will be paired with an empty (sub)string. Thus, grammars are a way of describing the structure of language. They let you perform the following types of tasks: • Test the syntactic correctness of a language statement. • Generate statements in the correct language. • Parsing language statements into hierarchical structures where, for example, the if condition is separated from the surrounding code and displayed in a tree-like structure ready to be evaluated.

222

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.2.1 Example: Natural Numbers The language of natural numbers can be represented using a grammar. We will take this set of characters as the alphabet: Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}. However, we want a more decent representation of all possible strings constructed with the characters in Σ, because numbers with leading zeros (000124) don't look good. We define several nonterminal symbols: first, for any digit except zero, for any digit, and for any sequence of s. As we know, several rules are possible for a nonterminal. Thus, to define , we can write as many rules as there are different options:

::= ::= ::= ::= ::= ::= ::= ::= ::=

'1' '2' '3' '4' '5' '6' '7' '8' '9'

However, as it is very complicated and not as easy to read, we will use a different notation to describe the exact same rules: ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' This notation is part of the canonical BNF. After adding a zero, we get a non-terminal rule, which encodes any digit. ::= '0' | We then define the nonterminal to encode all sequences of digits. A sequence of digits is defined recursively as a digit or a digit followed by another sequence of digits. 🇧🇷 It will serve us as a starting symbol. Either we are dealing with a one-digit number, which itself has no restrictions, or we have multiple digits, and therefore the first one must not be zero (otherwise it's a leading zero that we don't want to see); the rest can be arbitrary. Listing 12-1 shows the final result. Listing 12-1. natural_grammar ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ::= '0' | 🇧🇷 🇧🇷

223

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.2.2 Example: Simple Arithmetic Let's add some simple binary operations. To begin with, let's limit ourselves to addition and multiplication. Let's base this on an example shown in Listing 12-1. Let's add a non-terminal that will serve as a new initialization symbol. An expression is a number or a number followed by a binary operation symbol and another expression (so an expression is also defined recursively). Listing 12-2 shows an example. Listing 12-2. grammar_nat_pm ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ::= '0' | 🇧🇷 🇧🇷 🇧🇷 🇧🇷 '-' The grammar allows us to build a tree-like structure over the text, where each leaf is a terminal and each other node is a non-terminal. For example, let's apply the current set of rules to a string 1+42 and see how it breaks down. Figure 12-1 shows the result.

Figure 12-1. Parse tree for the expression 1+42 The first expansion is performed according to the rule ::= number '+'. The last expression is just a number, which is a sequence of a digit and a number.

12.2.3 Recursive Descent Writing parsers by hand is not difficult. To illustrate, let's show a parser that applies our new knowledge about grammars to literally translate the grammar description into the parsing code. Let's take a grammar for natural numbers that we already described in Section 12.2.1 and add just one more rule. The new start symbol will be str, which corresponds to "a number terminated by a null terminator". Listing 12-3 shows the revised grammar definition.

224

Chapter 12 ■ Syntax, Semantics, and Pragmatics

Listing 12-3. natural_grammar_nullterm ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ::= '0' | 🇧🇷 🇧🇷 ::= '\0' People generally operate with a sense of flow when analyzing grammar rules. A string is a sequence of everything that counts as symbols. Its interface consists of two functions: • bool expect(symbol) accepts a single endpoint and returns true if the stream contains exactly that endpoint type at the current position. • bool accept(symbol) does the same and, on success, advances the stream position by one. Until now, we've been operating with abstractions like symbols and flows. We can map all abstract notions to concrete instances. In our case, the symbol will match a single character.1 Listing 12-4 shows an example word processor created based on grammar rule definitions. This is a syntax checker, which checks that the string contains a natural number without leading zeros and nothing else (like spaces around the number). Listing 12-4. rec_desc_nat.c #include #include char const* stream = NULL ; bool accept(char c) { if (*stream == c) { stream++; return true; } else returns false; } bool notzero( void ) { return accept( '1' ) || accept('2') || accept('3') || accept('4')|| accept('5') || accept('6') || accept('7')|| accept('8') || accept('9'); } bool digit( void ) { return accept('0') || zero zero(); 🇧🇷

For programming language parsers, it is much easier to choose keywords and word classes (such as identifiers or literals) as terminal symbols. Breaking them down into individual characters introduces unnecessary complexity.

1

225

Chapter 12 ■ Syntax, Semantics, and Pragmatics

bool raw( void ) { if ( digit() ) { raw(); return true; } returns false; } number bool( void ) { if ( notzero() ) { raw(); return true; } else return accept('0'); } bool str( void ) { return number() && accept( 0 ); } void check( const char* string ) { stream = string; printf("%s -> %d\n", string, string()); } int main(void) { check("12345"); check("hello12"); check("0002"); check("10dbd"); tick("0"); return 0; } This example shows how each nonterminal maps to a function of the same name that tries to apply the relevant grammar rules. The analysis happens from top to bottom: we start with the most general starting symbol and try to break it down into parts and analyze them. When the rules start out equal, we factor them by first applying the common part and then trying to consume the remainder, as in the numerical function. The two branches start with overlapping nonterminals: and . Each of them contains the range 1...9, the only difference is that the range includes zero. So if we find a terminal in the range 1-9, we'll try to consume as many digits as possible and we'll succeed anyway. Otherwise, we check if the first digit is 0 and stop if so, without consuming more terminals. The function succeeds if at least one of the symbols in the range 1 to 9 is found. Due to the deferred enforcement of ||, not all accepted calls will be made. The first one that succeeds will finish evaluating the expression, so there will only be one step forward in the sequence. The function succeeds if a zero is found or if it succeeds, which is a literal translation of a rule: ::= '0' | The other functions are performed in the same way. If we weren't limited to a null terminator, the parsing would answer a question: "Does this sequence of symbols start with a valid language sentence?" In Listing 12-4, we purposely use a global variable for ease of understanding. We still strongly discourage its use in real programs.

226

Chapter 12 ■ Syntax, Semantics, and Pragmatics

Real programming language parsers are often quite complex. To write them, programmers use a special set of tools that can generate parsers from the declarative description close to the BNF. If you need to write a parser for a complex language, we recommend taking a look at the ANTLR or yacc parser generators. Another popular handwriting parser technique is called parser combiners. It encourages the creation of parsers for the most basic generic text elements (a single character, a number, a variable name, etc.). These small parsers are then combined (OR, AND, string...) and transformed (one or multiple occurrences, zero or more occurrences...) to produce more complex parsers. This technique, however, is easy to apply when the language supports a functional programming style, because it often relies on higher-order functions.

■■About recursion in grammars Grammar rules can be recursive, as you can see. However, depending on the analysis technique, it may be inadvisable to use certain types of recursion. For example, a rule expr ::= expr '+' expr, although valid, will not allow us to build a parser easily. To write a good grammar in this sense, you should avoid left-recursive rules like the one listed above, because, naively coded, they will only produce infinite recursion, when the expr() function starts its execution with another call to expr() . Rules that refine the first nonterminal on the right side of the production avoid this problem. ■■Question 240  Write a recursive descent parser for floating-point arithmetic with multiplication, subtraction, and addition. For this task, we assume that there are no negative literals (so instead of writing -1.20, we'll write 0-1.20.

12.2.4 Example: Arithmetic with Priorities The interesting part about expressions is that different operations have different priorities. For example, the addition operation has a lower priority than the multiplication operation, so all multiplications are performed before addition. Let's look at the naive grammar of integers with addition and multiplication in Listing 12-5. Listing 12-5. grammar_nat_pm_mult ::= '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ::= '0' | 🇧🇷 🇧🇷 🇧🇷 🇧🇷 🇧🇷 🇧🇷

(Video) How to Download Any Paid PDF Book Online For FREE I [100% working, Mid 2022 Updated]

227

Chapter 12 ■ Syntax, Semantics, and Pragmatics

Regardless of the multiplication precedence, the parse tree for the expression 1*2+3 will look as shown in Figure 12-2.

Figure 12-2. Parsing trees without priorities for the expression 1*2+3 However, as we noted, multiplication and addition are the same here: they are expanded in order of appearance. Because of this, the expression 1*2+3 is analyzed as 1*(2+3), breaking the common evaluation order, linked to the tree structure. From a parser's point of view, priority means that in the parse tree the "add" nodes must be closer to the root than the "multiply" nodes, since the sum is done on the larger parts of the expression. The evaluation of arithmetic expressions is done informally, starting with the leaves and ending with the root. How do we prioritize some operations over others? It is acquired by dividing a syntactic category into several classes. Each class is a refinement of the previous class. Listing 12-6 shows an example. Listing 12-6. grammar_priorities ::=" "" | 🇧🇷 🇧🇷 🇧🇷 🇧🇷 🇧🇷 🇧🇷 We can understand this example as follows: • is really any expression. • is an expression without , == and other terminals, which are present in the first rule. • It is also free of addition and subtraction.

228

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.2.5 Example: Simple Imperative Language To illustrate that this knowledge can be applied to programming languages, we give an example of the syntax of one. This syntax description provides definitions for the statements, which comprise typical imperative constructs: if, while, print, and assignments. Keywords can be treated as atomic terminals. Listing 12-7 shows the grammar. Listing 12-7. imp ::= | 🇧🇷 🇧🇷 🇧🇷 🇧🇷 🇧🇷 ::= "print" "(" ")" ::=IDENT"=" ::= "" "(" ")" "" ::= "" "(" ")" ::= "" | 🇧🇷 🇧🇷 🇧🇷 🇧🇷 🇧🇷 🇧🇷 NUMBER

12.2.6 Chomsky's Hierarchy Formal grammars, as we study them, are actually a subclass of formal grammars as Chomsky saw them. This class is called context-free grammars for reasons that will soon become apparent. The hierarchy consists of four levels ranging from 3 to 0, with the lowest levels being the most expressive and powerful. 3. Regular grammars are amazingly described by our old friends regular expressions. Finite automata are the weakest type of parsers because they cannot handle fractal structures like arithmetic expressions. Even in the simplest case, ::= number '+' , the part of the expression on the right side of the '+' is similar to the entire expression. This rule can be applied recursively for an arbitrary amount of time. 2. The context-free grammars we have already studied have rules that are of the non-terminal form ::= Any regular expression can also be described in terms of context-free grammars. 1. Context-sensitive grammars have rules of form: a Ab ::= a and b a and b denote an arbitrary (possibly empty) sequence of terminals and/or nonterminals and denote a non-empty sequence of terminals and/or nonterminals, and A is the nonterminal being expanded.

229

Chapter 12 ■ Syntax, Semantics, and Pragmatics

The difference between levels 2 and 1 is that the left nonterminal is replaced by y only when it occurs between a and b (which remain untouched). Remember, both a and b can be quite complex. 0. Unconstrained grammars have shape rules: sequence of terminal and non-terminal symbols ::= sequence of terminal and non-terminal symbols Since there are absolutely no restrictions on the left and right sides of the rules, these grammars are very powerful. It can be shown that these types of grammars can be used to code any computer program, so these grammars are Turing complete. Real programming languages ​​are almost never truly context-free. For example, using a previously declared variable is apparently a context-sensitive construction, because it is only valid when it follows a corresponding variable declaration. However, for simplicity, they are usually approximated with context-free grammars and then additional passes are made in the parse tree transformation to check that these context-sensitive conditions are met.

12.2.7 Abstract syntax tree There is a notion of abstract syntax. Describes trees that are built from source code. Concrete syntax describes the exact mapping between the keywords and the types of tree nodes to which they map. For example, suppose we rewrite the C compiler so that the while keyword is replaced with _while_. Next, imagine we rewrite all the programs so that this new keyword is used instead of while. In fact, the concrete syntax has changed, but the abstract syntax is the same, because the language constructs have remained the same. On the other hand, if we add a final clause to the if, it will add a statement that will be executed regardless of the value of the condition and we will also change the abstract syntax. The abstract parse tree is generally much more minimalistic compared to the parse trees. The parse tree would contain information relevant only to the parse (see Figure 12-3).

Figure 12-3. Parse tree and abstract syntax tree of expression 1+2*3 As we can see, the tree on the right is much more concise and straightforward. This tree can be directly evaluated by an interpreter or some executable code to calculate what can be generated.

230

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.2.8 Lexical Analysis Applying grammar rules directly to individual characters is actually overkill. It may be convenient to add an earlier step called lexical analysis. The raw text is first transformed into a sequence of lexemes (also called tokens). Each token is described with a regular expression and is extracted from the character stream. For example, a number can be described with a regular expression [0-9]+ and an identifier can be [a-zA-Z_] [0-9a-zA-Z_]*. After performing this processing, the text will no longer be a flat string of characters, but a linked list of tokens. Each token will be tagged with its type and for the parser, token types will be assigned to endpoints. It's easy to overlook all the formatting details (like line breaks and other white space symbols) during this step.

12.2.9 Analysis Summary The compiler analyzes the source code in several steps. Two important steps are lexical and syntactic analysis. During lexical analysis, program text is broken down into lexemes, such as integer literals or keywords. The text format is no longer relevant after this step. Each type of lexeme is best described using a regular expression. During parsing, a tree structure is built over the token stream. This structure is called an abstract syntax tree. Each node corresponds to a language construct.

12.3 Semantics The semantics of language is a correspondence between sentences as syntactic constructions and their meaning. Each sentence is usually described as a type of node in the program's abstract syntax tree. This description is done in one of the following ways: • Axiomatically. The current state of the program can be described with a set of logical formulas. Then each step of the abstract machine will transform these formulas in a certain way. • Denotationally. Each sentence of the language is assigned to a mathematical object of a certain theory (for example, domain theory). Then the effects of the program can be described in terms of this theory. It is of particular interest when reasoning about the program behavior of different programs written in different languages. • Operationally. Each sentence produces a certain change of state in the abstract machine, which is subject to description. The descriptions in the C pattern are informal, but they are more like the operational semantic description than the other two. The default language is the description of the language in a human-readable format. However, while it is more understandable to the unprepared, it is more verbose and, at times, less unambiguous. To write concise descriptions, a language of mathematical logic and lambda calculus is often used. We won't go into detail in this book, because this topic calls for a pedantic approach in and of itself. We refer you to books [29] and [35] for an immaculate study of type theory and language semantics.

231

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.3.1 Undefined behavior Semantic description integrity does not apply. This means that some language constructs are only defined for a subset of all possible situations. For example, dereferencing a pointer *x is only guaranteed consistent behavior when x points to a "valid" memory location. When x is NULL or points to deallocated memory, undefined behavior occurs. However, such an expression is absolutely syntactically correct. The standard intentionally introduces cases of undefined behavior. Why? First, it's easier to write compilers that produce code with fewer guarantees. Second, all defined behavior must be implemented. If we want to dereference a null pointer to cause an error, the compiler must do two things each time a pointer is dereferenced: • Try to figure out that at this exact location the pointer can never have NULL as a value. • If the compiler cannot deduce that this pointer is never NULL, it will output assembly code to check. If the pointer is NULL, this code will run a handler for it. Otherwise, the pointer will be dereferenced. Listing 12-8 shows an example. Listing 12-8. ptr_analysis1.c int x = 0; int* p = &x; ... /* do not write to `p` on these lines */ ... *p =10; /* this pointer cannot be NULL */ However, this is much more complicated than it sounds. In the example in Listing 12-8, we might assume that since there are no writes to the variable p, it always contains the address of x. However, this is not always true, as the example shown in Listing 12-9 illustrates. Listing 12-9. ptr_analysis2.c int x = 0; int* p = &x; ... /* do not write to `p` on these lines */ int**z=&p; *z = NULL; /* Still not a direct write to `p` */ ... *p =10; /* this pointer cannot be NULL -- is no longer true */ So solving this problem actually requires a very complex parsing in the presence of pointer arithmetic. Once the variable's address is obtained, or even worse, its address is passed to a function, you must parse the entire function call sequence, taking function pointers, pointers to pointers, etc. into account. The analysis will not always return correct results. (in the most general case this problem is even theoretically undecidable) and performance can be affected by it. So, in the spirit of Classical laissez-faire, fixing pointer dereferencing is left to the programmer himself.

232

Chapter 12 ■ Syntax, Semantics, and Pragmatics

In managed languages ​​like Java or C#, the defined behavior of pointer dereferencing is much easier to achieve. First, they usually run in a framework, which provides code to raise and handle exceptions. Second, null analysis is much simpler in the absence of address arithmetic. Finally, they are usually compiled just in time, which means that the compiler has access to the runtime information and can use it to perform some optimizations that are not available to an advanced compiler. For example, after the program has started and received input from the user, a compiler has inferred that the pointer x is never NULL if some condition P is true. It can then generate two versions of the function f that contains this dereference: one with verification and the other without verification. Therefore, each time f is called, only one of the two versions is called. If the compiler can prove that P is valid in a calling situation, the untested version is called; otherwise, the tag is invoked. Undefined behavior can be dangerous (and it often is). This leads to subtle bugs, because it doesn't guarantee a compiler or runtime error. The program may encounter a situation with undefined behavior and silently continue execution; however, its behavior will randomly change after a certain number of statements have been executed. A typical situation is heap corruption. In fact, the stack is structured; each block is delimited with useful information used by the standard library. Writing outside the block boundaries (but close to them) will likely corrupt this information, which will cause a failure during one of the future calls to malloc for free, making this bug a time bomb. These are some cases of undefined behavior explicitly specified by the C99 standard. We do not provide the complete list, as there are at least 190 cases. • Signed integer overflow. • Dereferencing an invalid pointer. • Compare pointers to elements from two different memory blocks. • Calling a function with arguments that don't match its initial signature (possible to take a pointer and convert it to another type of function). • Reading an uninitialized local variable. • Division by 0. • Access to an array element outside its bounds. • Attempting to change a string literal. • The return value of a function, which does not have an executed return statement.

12.3.2 Unspecified behavior It is important to distinguish between undefined behavior and unspecified behavior. The unspecified behavior defines a set of behaviors that can occur, but does not specify exactly which one to select. The selection will depend on the compiler. For example, • Function argument evaluation order is not specified. This means that when evaluating f(g(), h()) we have no guarantee that g() will be evaluated first and h() later. However, g() and h() are guaranteed to be evaluated before f(). • The order of evaluation of subexpressions in general, f(x) + g(x), does not force f to be executed before g. The unspecified behavior describes cases of nondeterminism in the abstract C machine.

233

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.3.3 Implementation-Defined Behavior The pattern also defines implementation-defined behavior, such as the size of int (which, as we said, depends on the architecture). We can think of the options as the abstract parameters of the machine: before starting, we have to choose these parameters. Another example of this behavior is the modulus x % y operation. The result in case of negative y is implementation defined. What is the difference between implementation-defined and unspecified behavior? The answer is that the implementation (compiler) has to explicitly document the choices it makes, whereas in cases of unspecified behavior anything from a range of possible behaviors can occur.

12.3.4 Sequence Points Sequence points are the locations in the program where the state of the abstract machine is consistent with the state of the target machine. We can think of them this way: when debugging a program, we can run it one step at a time, with each step roughly equivalent to a C statement. We usually stop at semicolons, function calls, || operator etc However, we can switch to the assembly view, where each instruction will possibly be encoded by many instructions, and execute these instructions in the same way. It allows us to execute only part of the instruction, stopping halfway through. At this time, the state of the abstract C machine is not well defined. After we finish executing the instructions that implement a single instruction, the machines' states become "synchronized", allowing us to explore not only assembly-level state, but also the state of the C program itself. sequence. The second equivalent definition of a sequence point is the location in the program where the side effects of the previous expressions already apply, but the side effects of the following expressions do not yet apply. Sequence points are: • Semicolon. • Comma (which in C can work the same as a semicolon, but also groups statements. Its use is discouraged). • Logical AND/OR (no bitwise versions!). • When the function's arguments are evaluated, but the function has not yet started executing. • Question mark in the ternary operator. Multiple real-world instances of undefined behavior are linked to the notion of sequence points. Listing 12-10 shows an example. Listing 12-10. seq_points.c int i = 0; i = i++ * 10; What am I like? Unfortunately, the best answer we can give is this: there is undefined behavior in this code. Apparently, we don't know whether i will be incremented before i*10 is assigned to i or after. There are two writes to the same memory location before the sequence point and it is not defined in which order they will occur. The reason for this is that, as we saw in Section 12.3.2, the order of evaluation of subexpressions is not fixed. Because subexpressions can have effects on memory state (think function calls or pre-increment or post-increment operators) and there is no mandatory order in which these effects occur, even the result of one subexpression can depend on the effects of another.

234

Chapter 12 ■ Syntax, Semantics, and Pragmatics

12.4 Pragmatics 12.4.1 Alignment From the point of view of the abstract machine, we are dealing with bytes of memory. Each byte has its address. However, the hardware protocols used on the chip are quite different. It is quite common for the processor to be able to read only packets of, say, 16 bytes, starting from an address divisible by 16. In other words, it can read the first or second 16-byte chunk of memory, but not a chunk that starts from an arbitrary address. We say data is aligned on the N byte boundary if it starts from an address divisible by N. Apparently, if the data is aligned on the kn byte boundary, it will automatically be aligned on the n byte boundary. For example, if the variable is aligned on a 16-byte boundary, it will be simultaneously aligned on an 8-byte boundary. Aligned data (8-byte boundary): 0x00 00 00 00 00 00 00 00 :11 22 33 44 55 66 77 88 Unaligned data (8-byte boundary): 0x00 00 00 00 00 00 00 00 :.. .. . 🇧🇷 11 22 33 44 55 0x00 00 00 00 00 00 00 07 :66 77 88 .. .. .. .. .. What happens when the programmer requests a read of a multi-byte value that spans two such blocks? (e.g. 8-byte value whose first three bytes are in one block and the rest in another)? Different architectures provide different answers to this question. Some hardware architectures prohibit unaligned memory access. This means that an attempt to read any value that is not aligned with, for example, an 8-byte boundary will result in an interrupt. An example of this architecture is SPARC. Operating systems can emulate misaligned access by intercepting the generated interrupt and placing complex access logic in the controller. Such operations, as you can imagine, are extremely expensive because interrupt handling is relatively slow. Intel 64 accommodates less strict behavior. Non-aligned accesses are allowed, but carry overhead. For example, if we want to read 8 bytes from address 6 and we can only read blocks of 8 bytes, the CPU (Central Processing Unit) will do two reads instead of one and then compose the requested value from the parts. of two quadruple words. Thus, aligned accesses are cheaper, as they require fewer readings. Memory consumption is usually less of a concern to a programmer than performance; therefore, compilers automatically adjust the alignment of variables in memory, even if doing so creates unused byte spaces. This is commonly known as data structure padding. Alignment is a parameter of code generation and program execution, so it is often seen as part of the pragmatics of the language.

12.4.2 Filling the data structure For structures, the alignment exists in two different senses: • The alignment of the structure instance itself. Affects the direction in which the structure starts. • Alignment of structure fields. The compiler may intentionally introduce gaps between structure fields to make accesses faster. Filling the data structure is related to this.

235

Chapter 12 ■ Syntax, Semantics, and Pragmatics

For example, we create a structure, shown in Listing 12-11. Listing 12-11. align_str_ex1 struct mystr { uint16_t a; uint64_tb; 🇧🇷 Assuming alignment on an 8-byte boundary, the size of this structure, returned by sizeof, will be 16 bytes. Field a starts at an address divisible by 8 and so six bytes are wasted to align b on an 8-byte boundary. There are several cases where we need to be aware of this: • You may want to change the balance between memory consumption and performance to decrease memory consumption. Imagine you are creating a million copies of structures and each structure wastes 30% of its size due to alignment gaps. Forcing the compiler to close these loopholes will lead to a memory usage gain of 30%, which is nothing to sneeze at. It also brings better locality benefits, which can be far more beneficial than aligning individual fields. • Reading file headers or accepting network data in structures must take into account possible gaps between fields in the structure. For example, the file header contains a 2-byte field and an 8-byte field. There are no spaces between them. Now we're trying to read that header into a structure, as shown in Listing 12-12. Listing 12-12. align_str_read.c struct str { uint16_t a; /* a 4-byte space */ uint64_t b; 🇧🇷 struct str mystr; fread( &mystr, sizeof( str ), 1, f ); The problem is that the structure layout has holes inside it, while the file stores the fields contiguously. Assuming the values ​​in the file are a=0x1111 and b=0x 22 22 22 22 22 22 22, Figure 12-4 shows the state of the memory after the read.

Figure 12-4. The memory layout structure and the data read from the file There are ways to control the alignment; up to C11 are compiler specific. Let's study them first. The #pragma keyword allows us to issue one of the pragmatic commands to the compiler. It is supported by MSVC, Microsoft's C compiler, and is also understood by GCC for compatibility reasons.

236

Chapter 12 ■ Syntax, Semantics, and Pragmatics

Listing 12-13 shows how to use it to change the alignment choice strategy locally using the pragma package. Listing 12-13. pragma_pack.c #pragma pack(push, 2) struct mystr { short a; long b; 🇧🇷 #pragma pack(pop) The second argument to pack is an assumed size of the chunk the machine can read from memory at the hardware level. The first argument to marshal is push or pop. During the translation process, the compiler keeps track of the current padding value by checking the top of the special internal stack. We can temporarily replace the current padding value by putting a new value on this new stack and restore the old value when we're done. It is possible to change the padding value globally using the following form of this pragma: #pragma pack(2) However, it is very dangerous because it leads to subtle and unpredictable changes in other parts of the program, which are very difficult to track . Let's see how the alignment value affects individual field alignment by looking at an example shown in Listing 12-14. Listing 12-14. pack_2.c #pragma pack(push, 2) struct mystr { uint16_t a; int64_tb; 🇧🇷 #pragma pack(pop) The padding value tells us how many bytes a hypothetical target computer can retrieve from memory in one read. The compiler tries to minimize the number of reads for each field. There's no reason to skip bytes between a and b here, because there's no benefit over the padding value. Assuming a=0x11 11 and b=0x22 ​​​​​​22 22 22 22 22 22 22, the memory layout will look like this: 11 11 22 22 22 22 22 22 22 22 Listing 12-15 shows another example with the value of padding equals 4. Listing 12-15. pack_4.c #pragma pack(push, 4) struct mystr { uint16_t a; int64_tb; 🇧🇷 #pragma package (pop)

237

Chapter 12 ■ Syntax, Semantics, and Pragmatics

What if we adapt the same memory layout without spaces? Since we can only read 4 bytes at a time, it's not ideal. We delimit the limits of memory fragments that can be read atomically. Package: 2 11 11 | 22 22 | 22 22 | 22 22 Pack: 4, same memory layout 11 1122 22 | 22 2222 22 Pack: 4, memory layout really 11 11?? 🇧🇷 🇧🇷 22 2222 22

🇧🇷 22 22 | 🇧🇷 🇧🇷 🇧🇷 22 22?? 🇧🇷 used | 22 2222 22

As we can see, when padding is set to 4, adapting to a gapless memory layout forces the CPU to perform three reads to access b. So basically, the idea is to minimize the number of reads while getting the structure members as close together as possible. The GCC-specific way of doing more or less the same thing is the packaged specification of the __attribute__ directive. In general, __attribute__ describes the further specification of a code entity, such as a type or a function. This packed keyword means that the struct fields are stored consecutively in memory without any spaces. Listing 12-16 shows an example. Listing 12-16. str_attribute_packed.c Struct__attribute__((packed)) mystr { uint8_t first; floating delta; floating position; 🇧🇷 Keep in mind that compressed structures are not part of the language and are not supported by some architectures (such as SPARC) even at the hardware level, which means not only a performance hit, but also program crashes or reading invalid values . 🇧🇷

12.5 Alignment in C11 C11 introduced a standardized form of alignment control. It consists of • Two keywords: –– _Alignas –– _Alignof • Header file stdalign.h, which defines the preprocessor aliases for _Alignas and _Alignof as alignas and • alignof • function align_alloc. Alignment is only possible to powers of 2: 1, 2, 4, 8, etc. alignof is used to know an alignment of a given variable or type. It is calculated at compile time, just like sizeof. Listing 12-17 shows an example of its use. Note the "%zu" format specifier which is used to print or scan values ​​of type size_t.

238

Chapter 12 ■ Syntax, Semantics, and Pragmatics

Listing 12-17. alignof_ex.c #include #include intmain(void) { short x; printf("%zu\n", alignof(x)); return 0; } In fact, alignof(x) returns the greatest power of two x to which it's aligned, since aligning anything by, say, 8 implies aligning by 4, 2, and 1 as well (all of its divisors). He prefers to use alignof for _Alignof and alignas for _Alignas. alignas accepts a constant expression and is used to force an alignment on a given variable or array. Listing 12-18 shows an example. Once launched, generate 8. Listing 12-18. alignas_ex.c #include #include int main( void ) { alignas( 8 ) short x; printf("%zu\n",alignof(x)); return 0; } By combining alignof and alignas we can align variables on the same boundary as other variables. You cannot align variables to a value less than their size, and aligns cannot be used to produce the same effect as __attribute__((packed)).

12.6 Summary In this chapter we structure and expand our knowledge about what a programming language is. We looked at the fundamentals of parser writing and studied the notions of undefined and unspecified behavior and why they are important. Then we introduce the notion of pragmatics and elaborate on one of the most important things. We've deferred a task from this chapter to the next, where we'll elaborate on the most important good coding practices. Assuming our readers aren't very familiar with C yet, we want them to adopt good habits as early as possible in the course of their C journey.

■■Question 241  What is the language syntax? ■■Question 242  What are grammars for? ■■Question 243  What is a grammar? ■■Question 244  What is BNF? ■■Question 245 How do we write a recursive descent parser that has the parser in BNF? 239

Chapter 12 ■ Syntax, Semantics, and Pragmatics

■■Question 246 How do we incorporate priorities into the grammatical description? ■■Question 247  What are the levels of Chomsky's hierarchy? ■■Question 248  Why are regular languages ​​less expressive than context-free grammars? ■■Question 249  What is lexical analysis? ■■Question 250  What is the semantics of the language? ■■Question 251  What is undefined behavior? ■■Question 252  What is unspecified behavior and how is it different from unspecified behavior? ■■Question 253  What are the cases of undefined behavior in C? ■■Question 254  What are the cases of unspecified behavior in C? ■■Question 255  What are sequence points? ■■Question 256  What is pragmatics? ■■Question 257  What is data structure padding? Is it portable? ■■Question 258  What is alignment? How can it be controlled in C11?

240

CHAPTER 13

Good Coding Practices In this chapter, we want to focus on coding style. When writing code, a developer is constantly faced with a decision-making process. What types of data structures should I use? How should they be named? Where and when should they be assigned? Experienced programmers make these decisions differently than beginners, and we think it's extremely important to talk about this decision-making process.

13.1 Decision-making Decisions often require a balance between two poles that are mutually exclusive. The classic example is that you can't ship a quality product cheaply and quickly. Fine-tuning code performance is often difficult to read and debug. Therefore, some code features should be prioritized over others based on common sense and the task itself. Because of that, these code guidelines are a good start, but blindly following them is not the way to go. Our advice for writing code is based on the following assumptions: 1. We want code to be as reusable as possible. This often requires careful planning and coordination between developers, which doesn't allow you to write code very quickly, but it pays off very quickly because it saves time for debugging and really lets you write complex software. Debugging programs is generally considered to be more difficult than writing them. Therefore, less code generally means less time spent debugging and more robust features. It's especially important for languages ​​like C, which are: • Insecure in the broadest sense (allow pointer arithmetic, don't perform bounds checks, etc.) • Lack an expressive type system, seen in languages ​​like Scala, Haskell , or OCaml . These types impose a series of restrictions on the program that must be met, otherwise the compiler will reject it. This rule has one notable exception. If reusing functions results in a drastic drop in performance, the algorithm will have an unnecessarily large complexity O. For example, we did an assignment on linked lists in Chapter 10. There was a function to compute the sum of all integers in a given list. One way to create it is shown in Listing 13-1.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_13

241

Chapter 13 ■ Good Coding Practices

Listing 13-1. list_sum_bad.c int list_sum( const struct list* l ) { size_t i; int sum = 0; /* We don't want to start the full size calculation * on each iteration of the loop */ size_t sz = list_size( l ); for( i = 0; i < sz; l = l-> next ) sum = sum + l->value; value returned; } In this example, for each i in the range from 0 to the length of the unique list, we actually start traversing the list from its first element. This results in a drastic drop in performance compared to a single addition going through the list. In the latter case, adding another element to the list results in additional access to the list, whereas in the program shown in Listing 13-1 it leads to additional access to the list_length(l) list. 2. The program must be easy to modify. This point is interdependent on the previous one. Smaller features are generally more reusable and therefore modifications become easier because more of the code from the previous version can be left intact. 3. Code should be as easy to read as possible. The key factors here are: • Sensible naming. Even if you are not a native English speaker, do not write variable names, function names or comments in your native language. • Consistency. Use the same naming conventions and consistent ways to perform similar operations. • Short and concise functions. If the logical description is too detailed, it's usually a sign of a lack of decomposition or that you need an abstraction layer. It also has a good effect on maintenance. 4. The code must be easy to test. Testing assures us that, at least in some elaborate cases, the code behaves as intended. Sometimes the task requires the opposite. For example, if we are writing code for a controller in the absence of a good optimizing compiler and with severely limited resources, we may be forced to abandon a beautiful code structure because the compiler cannot function properly; therefore, each call will affect performance, often unacceptably.

13.2 Code Elements 13.2.1 General Naming The specific naming convention is usually dictated by the language itself. In cases where the project is based on an existing code base, it may be reasonable not to deviate from it for the sake of consistency. In this book, we are using the following naming conventions: • All names are written in lower case. • Parts of the name are separated by an underscore, as follows: list_count.

242

Chapter 13 ■ Good Coding Practices

The rest of this section focuses on the different features of the language and the associated naming and usage conventions.

13.2.2 File structure include files must have include protection. They must be independent, which means that for each thisfile.h header file a .c file must be compiled with just the line #include "thisfile.h". The order of inclusions is usually chosen as follows: • Related title. • C library. • Other .h. • The H. of your project. Next, stick to a consistent order of declaring macros, types, functions, variables, etc. It greatly simplifies project navigation. A typical order is • for headers: 1. Include files. 2. Macros. 3. Types. 4. Variables (global). 5. Functions. • for .c files 1. Include files. 2. Macros. 3. Types. 4. Variables (global). 5. Static variables. 6. Functions. 7. Static functions.

13.2.3 Types • When possible (C99 or later), prefer types defined in stdint.h, such as uint64_t or uint8_t. • If you want to be POSIX compliant, do not define your own types with the _t suffix. It is reserved for standard types, so new types introduced in future revisions of the standard will not conflict with custom types defined in some programs. • Types are usually named with a prefix common to the project. For example, if you wanted to write a calculator, type tags would be prefixed with calc_.

243

Chapter 13 ■ Good Coding Practices

• When defining structures, and if you can choose the order of the fields, define them in the following order: –– First, try to minimize memory leaks from populating the data structure. –– Next, sort the fields by size. –– Finally, put them in alphabetical order. –– Sometimes structures have fields that should not be modified directly by the user. For example, a library defines the structure shown in Listing 13-2. Listing 13-2. struct_private_ex.c struct mypair { int x; int e; int _count; 🇧🇷 Fields in such a structure can be modified directly using dot or arrow syntax. Our convention, however, implies that only library-specific functions should modify the _refcount field, and the library user should never do this manually. C lacks a concept of a private field structure, so it's as close as we can get without more or less dirty tricks. –– Members of the enumeration must be written in uppercase, like constants. The common prefix is ​​suggested for the members of an enumeration. An example is shown in Listing 13-3. Listing 13-3. enum_ex.c Enum exit_code{ EX_SUCCESS, EX_FAILURE, EX_INVALID_ARGUMENTS };

13.2.4 Variables Choosing the right names for variables and functions is crucial. • Use nouns for names. • Boolean variables must also have meaningful names. It is advisable to prefix them with is_. Then add the exact property being checked. is_good is probably too broad to be a good name in most cases, as opposed to is_prime or is_before_last. He prefers positive names to negative ones as they are easily parsed by the human brain; for example, is_even over is_not_odd. • It is not recommended to use meaningless names like a, b or x4. The notable exception is code illustrating an article or document, which describes an algorithm in pseudocode using these names. In that case, any name changes are more likely to confuse readers than provide more clarity. The indices are traditionally called i and j and you'll understand if you stick with them.

244

Chapter 13 ■ Good Coding Practices

• Including the units of measure can be a good idea, for example, uint32_t delay_msecs. • Other suffixes are also useful, such as cnt, max, etc. For example, try_max (maximum attempts allowed), try_cnt (attempts made). • Global constants are named in uppercase letters. Global mutable variables are prefixed with g_. • Tradition dictates that global constants must be defined using the #define directive. However, the modern approach is to use either static const variables or just const globals. Unlike #defines, they are typed and also look better when debugging. If you have access to a quality compiler, you'll line them up anyway (if you decide it will be faster). • Use the const modifier when appropriate. C99 lets you create variables at arbitrary locations within functions, not just at the beginning of the block. Use it to store intermediate results in named constants. • Don't define global variables in header files! Define them in .c files and declare them in the .h file as extern.

13.2.5 About global variables Don't use global mutable variables if you can. We cannot stress this enough. These are the most important problems they bring: • In medium-sized projects and even more so in large projects with a large number of lines, all information about the function's effects is best located in its signature. A function f could call another function g, and so on, and somewhere in this chain a global variable will change. We cannot see that this change could occur by looking at f; we have to study all the functions it calls, and the functions it calls, and so on. • Perform non-accessible functions. This means that a function f cannot be called if it is already running. The latter can happen in two cases: –– Function f is calling other functions, which after some inner calls can call f again, when the first instance of f has not finished yet. Listing 13-4 shows an example of a function f that cannot be reentered. Listing 13-4. reenterability.c bool flag = true; int var = 0; void g(void) { f(); flag = false; } void f(void) { if (flag) g(); } –– The program is parallelized and the function is used in multiple threads (which is often the case on modern computers).

245

Chapter 13 ■ Good Coding Practices

In the case of a complex call hierarchy, knowing whether the function is reaccessible or not requires further analysis. • Present security risks as their values ​​often need to be verified before being modified or used. Programmers tend to forget about these controls. If something can go wrong, it will go wrong. • They make the testing function more difficult due to their dependence on the data they are inputting. However, writing code without testing is always a practice to be avoided. Global static mutable variables are also bad, but at least they don't pollute the global namespace in other files. However, global static immutable variables (const static) are perfectly fine and can often be hard-coded by the compiler.

13.2.6 Functions • Use verbs to name functions, for example, package_checksum_calc. • The is_ prefix is ​​also quite common for functions that check conditions, for example, int is_prime( long num ). • Functions that operate on a structure with a given label are usually prefixed with the name of the respective label, for example, bool list_is_empty(struct list* lst );. Since C doesn't allow for precise namespace control, this seems like the easiest way to deal with the chaos that arises when most functions are accessible from anywhere. • Use the static modifier for all resources except those you want to be available to everyone. • Probably the most important place to use const is for function arguments of type "pointer to immutable data". Ensures that the function does not occasionally change them due to programmer error.

13.3 Files and Documentation As the project grows, the number of files increases and becomes more difficult to navigate through them. To be able to handle large projects, you must structure them from the beginning. Below is a common template for the project root directory. source/

Source Files

doc/

documentation

resolution/

Resource files (such as images).

book/

Static libraries to be linked into the executable file.

accumulate/

The artifacts: an executable file and other generated files.

include/

Include files. This directory is added to the compiler's include search path using the -I flag.

object/

Generated object files. The linker assembles them into executables and libraries and they are not needed after the build is complete.

to set up

The initial configuration script that must be started before compilation. You can configure different target architectures or enable and disable features.

Makefile

Contains instructions for the automated build system. The file name and format vary depending on the system used.

246

Chapter 13 ■ Good Coding Practices

There are many building systems; some of the most popular ones for C are make, cmake, and automake. Different languages ​​have different ecosystems and often have dedicated build tools (e.g. Gradle or OCamlBuild). • We recommend that you study these projects which, as far as we know, are well organized www.gnu.org/software/gsl/ • www.gnu.org/software/gsl/design/gsl-design.html • www .kylheku. com/kaz/kazlib.html Doxygen is a de facto standard for creating documentation for C and C++ programs. It allows us to generate a fully structured set of HTML or LATEX pages from the program's source code. Function and variable descriptions are taken from specifically formatted comments. Listing 13-5 shows an example of a source file accepted by Doxygen. Listing 13-5. doxygen_example.h #pragma once #include #include /** @defgroup const_pool Constant pool */ /** Free allocated memory for pool contents */ void const_pool_deinit( struct vm_const_pool* pool ); /** Non-destructive constant pool combination * @param a First pool. * @param b Second pool. * @returns An initialized pool constant that combines the contents of both arguments * */ struct vm_const_pool const_combine( struct vm_const_pool const* a, struct vm_const_pool const* b ); /** Change the set of constants by adding the contents of the other set to the end. * @param[out] src The source group to modify. * @param fresh The group to be merged with the `src` group. */ void const_merge( struct vm_const_pool* src, struct vm_const_pool const* fresh ); /**@} */ Doxygen processes specially formatted comments (starting with /** and containing commands like @defgroup) to generate documentation for the respective code entities. For more information, see the Doxygen documentation.

247

Chapter 13 ■ Good Coding Practices

13.4 Encapsulation One of the foundations of thinking is abstraction. In software engineering, it is a process of hiding implementation details and data. If we want to implement a certain behavior like image rotation, we would like to think of image rotation only. The format of the input file, the format of its headers, is of little importance to us. The really important thing is to know how to work with the points that make up the image and to know their dimensions. However, you cannot write a program without considering all this information, which is actually independent of the rotation algorithm itself. Let's divide the program into parts; each part will do its purpose and it alone. This logic can be used by calling a set of exposed functions and/or a set of exposed global variables. Together they form an interface to this part of the program. However, to implement them, we usually have to write more functions, which are better hidden from the end user.

■■Working with version control systems  When working in a team where many people make changes simultaneously, it's very important to make roles smaller. If a function performs many actions and your code is huge, it will be more difficult to automatically merge multiple independent changes. In programming languages ​​that support packages or classes, they are used to hide code snippets and create interfaces for them. Unfortunately, C doesn't have any of them; moreover, there is no concept of “private fields” in structures: all fields are seen by everyone. Support for separate code files, called translation units, is the only real language feature that helps us to isolate parts of program code. We use the notion of module as a synonym for a translation unit, a .c file. The C standard does not define a module notion. In this book, we use them interchangeably because for the C language they are more or less equivalent. As we know, functions and global variables become public symbols by default and therefore accessible to other files. What is reasonable is to mark all "private" functions and global variables as static in the .c file and declare all "public" functions in the .h file. As an example, let's write a module that implements a stack. The header file will describe the structure and functions your instances can operate on. It resembles object-oriented programming with no subtyping (no inheritance). The interface will consist of the following functions: • Create or destroy a stack; • Pushing and removing items from a stack. • Make sure the battery is empty. • Launch a function for each item on the stack. The code file will define all the functions, and probably a few more, that will not be accessible outside of the code file and are created solely for the purpose of decomposing and reusing the code. Listings 13-6 and 13-7 show the resulting code. stack.h describes an interface. It has an include guard, lists all other headers (standard headers first, then custom headers), and defines custom types.

248

Chapter 13 ■ Good Coding Practices

Listing 13-6. stack.h #ifndef _STACK_H_ #define _STACK_H_ #include #include #include list of structures; struct stack{ list of structures* first; structure list* last; count size_t; 🇧🇷 stack structure stack_init(void); void stack_deinit( struct stack* st ); void stack_push( struct stack* s, int value ); intstack_pop(stack structure* s); bool stack_is_empty( struct stack const* s ); void stack_foreach( struct stack* s, void (f)(int) ); #endif/* _STACK_H_*/ There are two types defined: list and stack. The first one is only used internally inside the stack, so we declare it an incomplete type. Only pointers to instances of this type are allowed, unless their definition is specified later. For all, including stack.h, the list of type structures will remain incomplete. The stack.c implementation file, however, will define the structure, completing the type and allowing access to its fields (but only in stack.c). Next, the stack of structures is defined and the functions that work with it are declared (stack_push, stack_pop, etc.) (see Listing 13-7). Listing 13-7. stack.c #include #include list of structures "stack.h" { int value; list of structures* below; 🇧🇷 list of static structures* list_new( int item, list of structures* next ) { list of structures* lst = malloc( sizeof( *lst ) ); lst->value=item; lst->next=next; return lst; 🇧🇷

249

Chapter 13 ■ Good Coding Practices

void stack_push( struct stack* s, int value ) { s->first = list_new( value, s->first ); if ( s->last ==NULL) s->last =s->first; s->count++; } int stack_pop( struct stack* s ) { struct list* const head = s->first; int value; if (head) { if (head->next) s->first = head->next; value = head->value; free (head); if( -- s->count ) { s->first =s->last =NULL; } return value; } returns 0; } void stack_foreach( struct stack* s, void (f)(int) ) { struct list* cur; for( cur = s->first; cur; cur = cur->next) f( cur->value); } bool stack_is_empty( struct stack const* s ) { return s->count == 0; } struct stack stack_init( void ) { struct stack empty = { NULL, NULL, 0 }; return empty; } void stack_deinit( struct stack* st ) { while( ! stack_is_empty( st ) ) stack_pop( st ); st->first = NULL; st->last = NULL; } This file defines all the functions declared in the header. It can be split into multiple .c files, which is sometimes good for project structure; the important thing is that the compiler accepts them all and then the compiled code makes it to the linker. A static list_new function is defined to isolate the initialization of the struct list instance. It is not exposed to the outside world. During optimizations, the compiler can not only inline it, but also remove the function itself, effectively removing any potential performance implications from the code. The static tagging function is necessary (but not sufficient) for this optimization to take place. Also, static function statements can be placed closer to their respective callers, improving locality. By dividing your program into modules with well-described interfaces, you reduce overall complexity and achieve better reusability.

250

Chapter 13 ■ Good Coding Practices

The need to create header files makes modifications a bit complicated because the consistency of the headers with the code itself is up to the programmer. However, we can also take advantage of it by specifying a clear description of the interface, which is devoid of implementation details.

13.5 Immutability It is quite common to have to choose between creating a new modified copy of a structure and making modifications in place. Here are some advantages and disadvantages of both options. • Create copy: –– Easier to write: you won't accidentally pass the wrong instance to a function. –– Easier to debug because you don't have to keep track of variable changes. –– Can be optimized by the compiler. –– Compatible with parallelization. -- Might be slower. • Mutation of existing instance. -- Faster. –– Can become very difficult to debug, especially in a multithreaded environment. –– Sometimes simpler because you don't have to carefully and recursively copy structures with multiple pointers to other structures (this process is called deep copying). –– For objects with a distinct identity, this approach can be more intuitive and robust enough. Our perception of the real world is based on changing objects, because real world objects often have a distinct identity. When you turn on the phone, the phone is not replaced by its copy, but the state of the phone itself is changed. In other words, the phone's identity is maintained while its state changes. So in situations where you only have one instance of a given type and consecutive changes are made to it, it's fine to modify it instead of making a copy with every change.

13.6 Assertions There is a mechanism that allows testing certain conditions during program execution. When such a condition is not met, an error occurs and the program ends abnormally. To use the assert mechanism, we have to use #include and then use the assert macro. Listing 13-8 shows an example. Listing 13-8. assert.c #include int main() { int x = 0; assert( x != 0 ); return 0; 🇧🇷

251

Chapter 13 ■ Good Coding Practices

The condition given to the assertion macro is obviously false; therefore, the program will abort and report the assertion failure: assert: assert.c:6: parent: assertion `x != 0' failed. If the NDEBUG preprocessor symbol is defined (which can be achieved using the -D NDEBUG compiler option or the #define NDEBUG directive), the assertion is replaced with an empty string and therefore disabled. Therefore, assertions will result in zero overhead and checks will not be performed. You should use assertions to check for impossible conditions that signify inconsistency in program state. Never use assertions to perform checks on user input.

13.7 Error handling Although high-level languages ​​have some kind of error handling mechanism (which does not interfere with describing the core logic), C lacks one. There are three main ways to handle errors: 1. Use return codes. A function should not return a result as such, but a code that shows whether it was processed correctly or not. In the latter case, the code reflects the exact type of error that occurred. The result of the calculation is assigned a pointer that is accepted as an additional argument. Listing 13-9 shows an example. Listing 13-9. error_code.c enumdiv_res{ DIV_OK, DIV_BYZERO }; enum div_res div( int x, int y, int* result ) { if ( y!= 0) { * result =x/y; returns DIV_OK;} otherwise, returns DIV_BYZERO; } Symmetrically, you can return values ​​as you do and set the error code using a pointer to a respective variable. Error codes can be described using an enumeration or with multiple #defines. You can use them as indexes into a static array of messages or use a switch statement. Listing 13-10 shows an example. Listing 13-10. err_switch_arr.c enumerarr_code{ ERROR1, ERROR2 }; ... enumerarr_codeerr; ... switch (err) { case ERROR1: ... break; ERROR2 case: ...break;

252

Chapter 13 ■ Good Coding Practices

default: ...rest; } /* alternatively */ static const char* const messages[] = { "Is the first error\n", "The second error is\n" }; fprintf(stderr, messages[err]); Never use global variables as containers for error codes (or to return a value from a function). According to the C standard, there is a standard entity similar to an errno variable. It must be a modifiable lvalue and must not be declared explicitly. Its usage is similar to a global variable, although its value is thread-local. Library functions use it as an error code container, so after seeing a function fail (eg fopen returned NULL) you should check the value of errno for an error code. The manual pages for the respective function list possible errno values ​​(for example, EEXIST). Although this feature made it into the standard library, it is widely considered an anti-pattern and should not be emulated. 2. Use callbacks. Callbacks are function pointers that are passed as arguments and are called by the function that accepts them. They can be used to isolate error-handling code, but they often seem strange to people who are more used to traditional return code usage. Also, the execution order becomes less obvious. Listing 13-11 shows an example. Listing 13-11. div_cb.c #include int div( int x, int y, void (onerror)(int, int)) { if ( y != 0 ) return x/y; else{ onerror(x,y); return 0; } } static void div_by_zero(int x, int y) { fprintf( stderr, "Division by zero: %d / %d\n", x, y ); } int main(void) { printf("%d %d\n", div( 10, 2, div_by_zero ), div( 10, 0, div_by_zero ) ); return 0; 🇧🇷

253

Chapter 13 ■ Good Coding Practices

3. Using longjmp. This advanced technique will be explained in Section 14.3. There is a classic error recovery technique, which requires the use of goto. Listing 13-12 shows an example. Listing 13-12. goto_error_recover.c void foo(void) { if (!doA()) goto exit; if (!doB()) go to revertA; if (!doC()) go to reverseB; /* successful doA, doB and doC */ return; invertB: undoB(); revertA: undoA(); departure: return; } In this example, three actions were performed, all of which could fail. The nature of these actions is such that we have to clean up afterwards. For example, doA can enable dynamic memory allocation. In case doA succeeds but doB fails, we have to free that memory to avoid memory leaks. That's what the code called revertA does. Recoveries are performed in reverse order. If doA and doB succeed, but doC fails, we have to go back to B and then to A. So we mark the reversal stages with the tags and let the control pass through them. So goto revertB will return to doB first and then return to the code, returning to doA. This hack can usually be seen on a Linux kernel. Be careful though, gotos often make verification much more difficult, which is why they are sometimes banned.

13.8 On Memory Allocation • Many programmers discourage flexible arrays allocated on a stack. It's an easy way to get a stack overflow if you don't check the length well enough. What's even worse, there's no way to know if you can safely allocate that many bytes on a stack or not. • Don't abuse malloc! As you'll see in the last assignment in this chapter, malloc isn't cheap. Whenever you want to allocate something reasonably small, do it on a stack, like a local variable. If some function needs an address of a struct, it can take the address of a struct allocated from the stack and pass it along. This prevents memory leaks and improves code performance and readability. • Global variables pose no threat as long as they are constant. Static local variables are the same. Use them if you want to limit a function's use of a particular constant.

254

Chapter 13 ■ Good Coding Practices

13.9 On Flexibility We actually advocate code reuse. Taking this to an extreme, however, results in an absurd amount of abstraction layers and boilerplate code that is only present to support a potential future need for additional features (which may never happen). There is no silver bullet, in a broad sense. Every programming style, every computing model, is nice and concise in some cases and voluminous and verbose in others. Likewise, the best tool is an expert rather than a jack of all trades. It could turn an image viewer into a powerful editor capable of playing video and editing IDv3 tags, but the image viewer facet will surely suffer, as will the user experience. Writing more abstract code can have benefits because that code is easier to adapt to new contexts. At the same time, it introduces complexity that may be unnecessary. Just generalize to the point where it doesn't hurt. To know when to stop, you need to answer several questions, such as: • What is the purpose of your program or library? • What are the functionality limits you envision for your program? • Will it be easier to write, use and/or debug this function if it is written more generally? While the first two questions are highly subjective, the last one can be provided with an example. Let's take a look at the code shown in Listing 13-13. Listing 13-13. dump_1.c void dump( char const* filename ) { FILE* f = fopen( filename, "w" ); fprintf(f, "this is dump %d", 42 ); fclose(f); } Compare it with another version with the same logic, split into two functions, shown in Listing 13-14. Listing 13-14. dump_2.c void dump( FILE* f ) { fprintf(f, "this is dump %d", 42 ); } void fun( void ) { FILE* f = fopen( "dump.txt", "w" ); dump(f); fclose(f); } The second version is preferable for two reasons: • The first version requires a filename, which means you cannot use it to write to stderr or stdout. • The second version separates file open logic and file write logic. If you want to handle errors that might occur in calls to fprintf, fopen, or fclose, you'll do that separately for fopen, keeping the functions relatively simple. The dump function will not handle file open errors: it will not be called if opening fails.

255

Chapter 13 ■ Good Coding Practices

Listing 13-15 shows an example of the same logic with error handling. As you can see, there is no error handling for opening and closing files in the dump function; it's made into fun instead. Listing 13-15. file_open_sep.c #include enum stat { STAT_OK, STAT_ERR_OPEN, STAT_ERR_CLOSE, STAT_ERR_WRITE }; enum stat dump( FILE * f ) { if ( fprintf( f, "this is dump %d", 42 ) ) return STAT_OK; returnSTAT_ERR_WRITE; } enum stat fun( void ) { enum stat dump_stat; FILE *f; f =fopen("dump.txt","w"); if (!f) returns STAT_ERR_OPEN; dump_stat = dump( f ); if ( dump_stat != STAT_OK ) returns dump_stat; if (! fclose( f ) ) returns STAT_ERR_CLOSE; return STAT_OK; } In case of multiple writes to the dump function, the function will be overloaded and therefore less readable.

13.10 Task: Image Rotation You must create a program to rotate a BMP image of any resolution 90 degrees clockwise.

13.10.1 BMP File Format The BMP (BitMaP) format is a raster graphics format, which means that it stores an image as a table of colored dots (pixels). In this format the color is encoded with fixed length numbers (it can be 1, 4, 8, 16 or 24 bits). If 1 bit per pixel is used, the image is black and white. If 24 bits are used, the number of possible different colors is approximately 16 million. We only implement 24-bit image rotation. The subset of BMP files your program must be able to work with is described by the structure shown in Listing 13-16. Represents the file header, immediately followed by the pixel data.

256

Chapter 13 ■ Good Coding Practices

Listado 13-16. bmp_struct.c #include Struct __attribute__((empaquetado)) bmp_header { uint16_t bfType; uint32_tbfileSize; uint32_t bfReservado; uint32_t bOffBits; uint32_t biTamaño; uint32_t biAncho; uint32_tbiAltura; uint16_tbiPlanes; uint16_t biBitCount; uint32_t biCompresión; uint32_t biSizeImage; uint32_t biXPelsPerMeter; uint32_t biYPelsPerMeter; uint32_t biClrUtilizado; uint32_tbiClrImportante; };

■■Question 259 Read the BMP file specification to identify what these fields are responsible for. The file format depends on the bit count per pixel. There are no color palettes when using 16 or 24 bits per pixel. Each pixel is encoded by 24 bits or 3 bytes, as shown in Listing 13-17. Each component is a number from 0 to 255 (one byte) that shows the presence of blue, green or red color in this pixel. The resulting color is an overlay of these three basic colors. Listing 13-17. pixel.c structpixel{ unsigned character b, g, r; } Each line of pixels is padded so that its length is a multiple of 4. For example, the width of the image is 15 pixels. It corresponds to 15 × 3 = 45 bytes. To pad it out, we skip 3 bytes (to the nearest multiple of 4.48) before starting the new line of pixels. Because of this, the actual size of the image will differ from the product of the width, height and pixel size (3 bytes).

■■Note Remember to open the image in binary mode!

257

Chapter 13 ■ Good Coding Practices

13.10.2 Architecture We want to think of a program architecture that is extensible and modular. 1. Describe the pixel structure struct pixel so it doesn't work directly with the raster table (like with completely unstructured data). This must always be avoided. 2. Separate the internal image rendering from the input format. The rotation is done in the internal image format, which is then serialized back to BMP. There may be changes to the BMP format, you may want to support other formats, and you may not want to tightly couple the rotation algorithm to BMP. To achieve this, define a struct-image to store the array of pixels (continuous, now without padding) and some information that must actually be persisted. For example, there is absolutely no need to store the BMP signature here or any of the never-used header fields. We can get rid of the width and height of the image in pixels. You will need to create functions to read an image from a BMP file and write it to a BMP file (probably also to generate a BMP header from the internal representation). 3. Separate opening the file from reading it. 4. Make error handling unified and handle errors exactly in one place (for this program it is enough). To do so, define the from_bmp function, which will read a file from the stream and return one of the codes that shows whether the operation was completed successfully or not. Remember flexibility concerns. Your code should be easy to use in applications with a graphical user interface (GUI) as well as those without a GUI, so throwing stderr prints all over the place is not a good option: limit it to the code handler's piece of code. errors. Your code should also be easily adaptable to different input formats. Listing 13-18 shows several initialization code snippets. Listing 13-18. image_rot_stub.c #include #include struct pixel { uint8_t b,g,r; 🇧🇷 struct image { uint64_t width, height; structure pixel_t* data; 🇧🇷 /*deserializer*/ enum read_status{ READ_OK = 0, READ_INVALID_SIGNATURE, READ_INVALID_BITS, READ_INVALID_HEADER /* more codes*/ }; enum read_status from_bmp( FILE* in, struct image* const read );

258

Chapter 13 ■ Good Coding Practices

/*image_t from_jpg( FILE* );... *and other deserializers are possible *All necessary information will be *stored in the image structure */ /* make a rotated copy */ struct image rotate( struct image const source ); /*serializer*/ enumwrite_status{ WRITE_OK = 0, WRITE_ERROR /* more codes */ }; enum write_status to_bmp( FILE* out, image structure const* img );

■■Question 260 Implement blur. This is done very simply: for each pixel, it calculates its new components as an average over a 3 × 3 pixel window (called the kernel). Edge pixels are left intact. ■■Question 261 Implement rotation to an arbitrary angle (not just 90 or 180 degrees). ■■Question 262 Implement “dilation” and “erosion” transformations. They are similar to blurring, but instead of averaging over a window, you must calculate the minimum (erosion) or maximum (dilation) values ​​of the components.

13.11 Task: Custom Memory Allocator In this task, we are going to implement our own version of malloc and free based on a memory mapping system called mmap and a linked list of chunks of arbitrary sizes. It can be seen as a simplified version of a typical standard C library memory manager and shares most of its weaknesses. The use of malloc/calloc, free and realloc is prohibited for this task. As we know, these functions are used to manipulate the heap. The stack consists of anonymous pages and is actually a linked list of fragments. Each fragment consists of a header and the data itself. The header is described by a structure shown in Listing 13-19. Listing 13-19. mem_str.c struct mem{ struct mem* next; size_t capacity; bool is_free; 🇧🇷

259

Chapter 13 ■ Good Coding Practices

The header is immediately followed by the usable area. We need to store both the size and the link for the next block because in our case the heap can have holes for two reasons. • The head of the heap can be placed between two already allocated regions. • The stack can reach an arbitrary size. A heap allocation is splitting the first available block into two (given its size this is enough). Mark the first part as not free and return your address. If there are no free blocks large enough for the requested size, the allocator will try to get more memory from the operating system by calling mmap. It makes no sense to allocate blocks of 1 or 3 bytes; they are too small. It's usually a waste since the header size is bigger anyway. So let's introduce a BLOCK_MIN_SIZE constant for the minimum allowed block size (not including the header). Given a request for query bytes, we first change it to BLOCK_MIN_SIZE if it is too small. We then iterate over the chain of blocks and apply the following logic to each block: • queryability, address->is_free); for ( i = 0; i next ) memalloc_debug_struct_info( f, ptr ); } An estimated number of lines of code is 150-200. Don't forget to write a Makefile.

261

Chapter 13 ■ Good Coding Practices

13.12 Summary In this chapter, we have discussed extensively some of the most important recommendations about coding style and program architecture. We've seen naming conventions and the reasons behind common code guidelines. When we write code, we must adhere to certain restrictions derived from our code requirements as well as the development process itself. We saw concepts as important as encapsulation. Finally, we provide two more advanced tasks where you can apply your new knowledge of program architecture. In the next part, we'll delve into the details of translation, review some language features that are easier to understand at the assembly level, and talk about compiler and performance optimizations.

262

PART III

Between C and Assembly

CHAPTER 14

Translation Details In this chapter, we will review the notion of convention convening to deepen our understanding and work through translation details. This process requires an understanding of how the program works at the assembly level and some familiarity with C. We'll also review some classic low-level security vulnerabilities that a careless programmer can open. Understanding these low-level translation details is sometimes crucial to rooting out very subtle bugs that aren't revealed on every run.

14.1 Function Call Sequence In Chapter 2, we studied how procedures are called, how they return values, and how they accept arguments. The complete call sequence is described in [24] and we strongly recommend that you take a look at it. Let's review this process and add valuable details.

14.1.1 XMM Registers In addition to the registers we've already discussed, modern processors have several sets of special registers that come from processor extensions. An extension provides additional circuitry, extends an instruction set, and sometimes adds usable registers. One notable extension is called SSE (Streaming SIMD Extensions) and it describes a set of xmm registers: xmm0, xmm1, ..., xmm15. They are 128 bits wide and are typically used for two types of tasks: • Floating point arithmetic; and • SIMD instructions (these instructions are performing an action on multiple data). The usual mov command cannot work with xmm records. Instead, the movq command is used to copy data between the least significant half of the xmm registers (64 bits of 128) on one side and the xmm registers, general purpose registers, or memory on the other side (also 64 bits). 🇧🇷 To fill the entire xmm record, you have two options: movdqa and movdqu. The first is decrypted as "move double word quad aligned", the second is the unaligned version. Most SSE instructions require memory operands to be correctly aligned. Misaligned versions of these instructions often exist with different mnemonics and carry a performance penalty due to misaligned reading. Because SSE instructions are often used in performance-sensitive locations, it is generally wiser to stick with instructions that require operand alignment. We will use SSE instructions to perform high-performance calculations in Section 16.4.1.

■■Question 263  Read about the movq, movdqa and movdqu instructions in [15].

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_14

265

Chapter 14 ■ Translation Details

14.1.2 Calling Convention A calling convention is a set of rules about the sequence of function calls that a programmer voluntarily adheres to. If everyone follows the same rules, smooth interoperability is guaranteed. However, once someone breaks the rules, for example, makes changes and does not restore rbp in a certain function, anything can happen: nothing, a delayed crash or an immediate crash. The reason is that other functions are written with the implication that these rules are respected and rely on rbp not being touched. The calling conventions declare, among other things, the argument passing algorithm. In the case of the typical *nix x86 64 convention we are using (described fully in [24]), the following description is a fairly accurate approximation of what the function is called. 1. First, records that need to be kept are kept. All registers except the seven registers saved by the callee (rbx, rbp, rsp and r12-r15) can be changed by the called function, so if their value is of any importance they should be stored (probably on a stack). 🇧🇷 2. Registers and stack are filled with arguments. The size of each argument is rounded to 8 bytes. Arguments are divided into three lists: (a) Integer or pointer arguments. (b) Floaters and doubles. (c) Arguments passed to memory via the stack ("memory"). The first six arguments of the first list are passed in general purpose registers (rdi, rsi, rdx, rcx, r8 and r9). The first eight arguments of the second list are passed in registers xmm0 to xmm7. If there are more arguments from these lists to pass, they are passed on the stack in reverse order. This means that the last argument will be on top of the stack before the call is made. While integers and floats are fairly trivial to deal with, the structures are a bit more complicated. If a structure is longer than 32 bytes or has misaligned fields, it will be passed into memory. A smaller structure is decomposed into fields and each field is treated separately and, if it is an internal structure, recursively. Therefore, a two-element structure can be passed in the same way as two arguments. If a field in a structure is considered "memory", it is propagated to the structure itself. The rbp register, as we will see, is used to address arguments passed in memory and local variables. And the return values? Integer and pointer values ​​are returned in rax and rdx. Floating point values ​​are returned in xmm0 and xmm1. Large structures are returned via a pointer, provided as an additional hidden argument, in the spirit of the following example:

266

Chapter 14 ■ Translation Details

struct s { char vals[100]; 🇧🇷 structure s f( int x ) { structure s my; my.waltz[10] = 42; return mine; } void f( int x, struct s* ret ) { ret->vals[10] = 42; } 3. The call statement must then be called. Its parameter is the address of the first statement of a called function. Push the sender address onto the stack. Each program can have multiple instances of the same function launched at the same time, not only on different threads, but also due to recursion. Each of these function instances is stored on the stack, because their main principle, "last in, first out," corresponds to how functions begin and end. If a function f is started and then a function g is called, g is finished first (but was called last) and f is finished last (while called first). The stack frame is a portion of a stack dedicated to a single function instance. Stores the values ​​of local variables, temporary variables and saved registers. Function code is usually enclosed in a pair of prolog and epilogue, which are similar for all functions. The prologue helps initialize the stack frame, and the epilogue deinitializes it. During function execution, rbp remains unchanged and points to the top of its stack frame. It is possible to address local variables and stack arguments relative to rbp. This is reflected in the prolog function shown in Listing 14-1. Listing 14-1. function prolog.asm: push rbp mov rbp, rsp sub rsp, 24; data 24 is the total size of the local variables. The old value of rbp is saved to be restored later in the epilogue. Then a new rbp is set on top of the current stack (which, by the way, stores the old rbp value now). Then the memory for the local variables is allocated on the heap by subtracting their total size from rsp. This is automatic memory allocation in C and the technique we used in first allocation to allocate buffers on the heap. The functions end with an epilogue shown in Listing 14-2. Listing 14-2. epilogue.asm mov rsp, rbp pop rbp ret

267

Chapter 14 ■ Translation Details

By moving the stack frame from the starting address to rsp, we can be sure that all memory allocated on the stack has been deallocated. Then rbp is reset to the previous value and now rbp points to the beginning of the previous stack frame. Finally, ret pops the stack return address for popping. Sometimes the compiler chooses a completely equivalent alternative form. It is shown in Listing 14-3. Listing 14-3. epilogue_alt.asm Leave ret The leave statement is specially designed to destroy stack frames. Compilers don't always use its counterpart, enter, because it's more functional than the sequence of instructions shown in Listing 14-1. Intended for languages ​​with built-in function support. 4. After leaving the role, our work is not always done. In case there are arguments that were passed in memory (stack), we also have to get rid of them.

14.1.3 Example: Simple Function and Its Stack Let's take a look at a simple function that calculates at most two values. Let's compile it without optimizations and see the assembly list. Listing 14-4 shows an example. Listing 14-4. max.c int max( int a, int b ) { char buffer[4096]; if (a < b) returns b; return one; } int main(void) { int x = max(42, 999); return 0; } Listing 14-5 shows the disassembly produced by objdump. Listing 14-5. Maximum. ASM 0000000000004004B6: 4004B6: 55PUSHRBP 4004B7: 48 89 E5MOVRBP, RSP 400BA: 48 81 EC 90 0F 00 00SUBRSP, 0XF90 4004C1: 89 BD FC FC FF FF FMOVDWORD PT DWORD 400

268

[rbp-0x1004],edi [rbp-0x1008],esi PTR [rbp-0x1004] PTR [rbp-0x1008]

Chapter 14 ■ Translation Details

4004d9: 7D 08JGE4004E3 4004DB: 8B 85 F8 EF FFMoveAX, DWORD PTR [RBP-0x1008] 4004E1: EB 06JMP4004E9 4004E3: 8B 85 FC EF FFFMSSPEX, DPT [Ptrome9 4004E3: 8B 85 FCFMSPEX, DPT. 00 00Movedi, 0x2a 4004fd: e8 b4 b4 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffbd After a bit of cleanup, we get pure, more readable assembly code, shown in Listing 14-6 . Listing 14-6. maximo_refined.asm mov rsi, 999 mov rdi, 42 call maximo... maximo: push rbp mov rbp, rsp sub rsp, 3984 mov [rbp-0x1004], edi mov [rbp-0x1008], esi mov eax, [rbp- 0x1004] ... Exit ret

■■Log Assignment  See Section 3.4.2 for an explanation of why changing esi means changing the entire rsi. Let's trace the function call and its prolog (see Listing 14-6) and pop the contents of the stack immediately after it executes.

269

Chapter 14 ■ Translation Details

max call

push rbp

mov rbp, rsp

270

Chapter 14 ■ Translation Details

sub rs, 3984

14.1.4 Red zone The red zone is a 128-byte area that extends from rsp to lower addresses. Relax the “no data below rsp” rule; it is safe to allocate data there and will not be overwritten by system calls or interrupts. We are talking about direct memory writes against rsp without changing rsp. However, function calls will still override the red zone. The red zone was created to allow for specific optimization. If a function never calls other functions, you can skip creating stack frames (rbp changes). Local variables and arguments will be discussed in relation to rsp, not rbp. • Total size of local variables is less than 128 bytes. • A function is a leaf function (does not call other functions). • The function does not change rsp; otherwise it is impossible to address memory relative to it. By moving rsp forward, you can still get more free space to allocate your data than 128 bytes on the stack. See also section 16.1.3.

14.1.5 Variable Number of Arguments The calling convention we are using supports counting variable arguments. This means that the function can accept an arbitrary number of arguments. It's possible because argument passing (and clearing the stack after the function completes) is the responsibility of the calling function. The declaration of such functions contains so-called ellipses: three dots in place of the last argument. The typical function with a variable number of arguments is our old friend printf. void printf(format char const*, ...);

271

Chapter 14 ■ Translation Details

How does printf know the exact number of arguments? It knows for sure that at least one argument (const* char format) was passed. By parsing this string and counting the specifiers, it will calculate the total number of arguments as well as their types (which records they should be in).

■■Note In the case of a variable number of arguments, al must contain the number of xmm registers used by the arguments. As you can see, there is absolutely no way to tell exactly how many arguments were passed. The function deduces it from the arguments that are actually present (format in this case). If there are more format specifiers than arguments, printf won't know and will naively try to get the contents of the respective registers and memory. Apparently, this functionality cannot be coded in C directly by a programmer, as the registers cannot be accessed directly. However, there is a portable mechanism for declaring functions with a variable argument count that is part of the standard library. Each platform has its own implementation of this mechanism. It can be used after including the stdarg.h file and consists of the following: • va_list – A structure that stores information about arguments. • va_start: A macro that initializes va_list. • va_end: a macro that initializes va_list. • va_arg: A macro that takes the next argument in the argument list when given an instance of va_list and an argument type. Listing 14-7 shows an example. The printer function accepts multiple arguments and an arbitrary number of them. Listing 14-7. vararg.c #include #include void printer(unsigned long argcount, ... ) { va_list args; I haven't subscribed for a long time; va_start(arguments, number of arguments); for (i = 0; i < argcount; i++ ) printf(" %d\n", va_arg(args, int )); va_end(arguments); } int main() { printer(10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0); return 0; } First, va_list is initialized with the name of the last argument before the dots by va_start. Then each call to va_arg gets the next argument. The second parameter is the name of the new argument type. At the end, va_list is deinitialized using va_end.

272

Chapter 14 ■ Translation Details

Since a type name is converted to an argument and va_list is used by name but is changed, this example may seem confusing.

■■Question 264  Can you imagine a situation where a function, not a macro, accepts a variable by name (syntactically) and changes it? What should the type of this variable be?

14.1.6 vprintf and friend functions like printf, fprintf, etc. have special versions. These accept va_list as their last arguments. Their names are prefixed with a letter v, for example, int vprintf(const char *format, va_list ap); They are used inside custom functions which in turn accept an arbitrary number of arguments. Listing 14-8 shows an example. Listing 14-8. vsprintf.c #include #include void logmsg( int client_id, const char* const str, ... ) { va_list args; character buffer[1024]; char* bufptr = buffer; va_start(arguments, string); bufptr += sprintf(bufptr, "from client %d :", client_id ); vsprintf(bufptr, str, args); fprintf(stderr, "%s", buffer); va_end(arguments); 🇧🇷

14.2 volatile The volatile keyword greatly affects how the compiler optimizes code. The calculus model for C is a von Neumann one. It does not support parallel program execution, and the compiler generally tries to perform as many optimizations as possible without changing the program's observable behavior. It may include instruction reordering and caching variables in registers. Skip reading a value from memory that is not written anywhere. However, reading and writing to volatile variables always takes place. The order of operations is also preserved.

273

Chapter 14 ■ Translation Details

The main use cases are as follows: • Memory-mapped I/O, when communication with external devices is accomplished by interacting with a certain dedicated memory region. Writing a character to video memory (resulting in its display on the screen) really means that. • Data exchange between threads. If memory is used to communicate with other threads, you don't want writes or reads to be optimized. Note that volatile by itself is not sufficient to do robust cross-communication. Like the const qualifier, in the case of a pointer, volatile can be applied to the data it points to as well as the pointer itself. The rule is the same: volatile to the left of the asterisk refers to the data it points to, and to the right - to the pointer itself.

14.2.1 Slow Memory Allocation Many operating systems allocate pages slowly, at the time of first use, rather than right after the mmap call (or equivalent). If the programmer doesn't want delays in first page uses, he can choose to address each page individually so that the operating system actually creates it, as shown in Listing 14-9. Listing 14-9. lma_bad.c char* ptr; for(ptr = start; ptr < start + size; ptr += page size) *ptr; However, this code has no observable effect from the compiler's point of view, so it can be fully optimized. However, when the pointer is marked volatile, this is not the case. Listing 14-10 shows an example. Listing 14-10. lma_good.c volatile character* ptr; for(ptr = start; ptr < start + size; ptr += page size) *ptr;

■■Volatile pointers in the standard language If the volatile pointer points to non-volatile memory, according to the standard there are no guarantees! They only exist when the pointer and memory are volatile. So, by default, the example above is incorrect. However, because programmers use volatile pointers with exactly this reasoning, most commonly used compilers (MSVC, GCC, clang) do not optimize for dereferencing volatile pointers. There is no standard way to do this.

14.2.2 Generated Code Let's study the example shown in Listing 14-11.

274

Chapter 14 ■ Translation Details

Listing 14-11. volatile_ex.c #includes int main( int argc, char** argv ) { ordinary int = 0; volatile int vol = 4; ordinary++; vol++; printf("%d\n", standard); printf("%d\n", volume); return 0; } There are two variables: one is volatile, the other is not. Both are incremented and assigned to printf as arguments. GCC will generate the following code (with optimization level -O2), shown in Listing 14-12: Listing 14-12. volatile_ex.asm; these are two arguments to `printf` movesi,0x1 movei,0x4005d4 ; vol = 4 movDWORD PTR [rsp+0xc],0x4 ; vol ++ moveax,DWORD PTR [rsp+0xc] addeax,0x1 movDWORDPTR[rsp+0xc],eax xoreax,eax ; printf("%d\n", standard); the 'common' isn't even created in the stack structure; its final precomputed value 1 was put into `rsi` in the first line! call4003e0 ; the second argument is taken out of memory, it's volatile! movei,DWORD PTR [rsp+0xc] ; The first argument is the address of "%d\n" movei,0x4005d4 xoreax,eax; printf( "%d\n", vol ) call4003e0 xoreax,eax As we see, the content of a volatile variable is actually read and written every time it occurs in C. The common variable will not even be created: the calculations will be performed in time compilation process and the final result is stored in rsi, waiting to be used as the second argument of a call.

275

Chapter 14 ■ Translation Details

14.3 Nonlocal jumps – setjmp The standard C library contains machines for performing a very complicated kind of trick. Lets you store a compute context and restore it. The context describes the program's execution state with the exception of the following: • Everything related to the outside world (for example, open descriptors). • Context of floating point calculations. • Stack Variables. This allows us to save the context and return to it if we feel like we should go back. We are not limited by the same scope of functions. Include setjmp.h to gain access to the following machinery: • jmp_buf is a type of variable that can store context. • int setjmp(jmp_buf env) is a function that accepts an instance of jmp_buf and stores the current context in it. By default, it returns 0. • void longjmp(jmp_buf env, int val) is used to return to a saved context, stored in a given variable of type jmp_buf. When returning from longjmp, setjmp does not necessarily return 0, but rather the val value fed to longjmp. Listing 14-13 shows an example. The first setjmp will return 0 by default, as well as the value of val. However, longjmp accepts 1 as its argument and program execution will continue from the call to setjmp (because they are linked using jb). This time setjmp will return 1 and this is the value that will be assigned to val. Listing 14-13. longjmp.c #include #include int main(void) { jmp_buf jb; int value; val = setjmp(jb); puts("Hello!"); if (val == 0) longjmp( jb, 1 ); else puts("End"); return 0; } Local variables that are not marked volatile will have undefined values ​​after longjmp. This is the source of errors as well as problems related to freeing memory: it is difficult to parse the control flow in the presence of longjmp and ensure that all dynamically allocated memory is freed. In general, calling setjmp as part of a complex expression is allowed, but only in rare cases. In most cases, this is undefined behavior. So better not. It is important to remember that all this machinery is based on the use of stack frames. This means you cannot run longjmp in a function with an uninitialized stack frame. For example, the code shown in Listing 14-14 produces undefined behavior for exactly this reason.

276

Chapter 14 ■ Translation Details

Listing 14-14. longjmp_ub.c jmp_buf jb; void f(void) { setjmp(jb); } void g(void) { f(); lengthjmp(jb); } The f function has already finished, but we are running longjmp on it. The program's behavior is undefined because we are trying to restore a context within a destroyed stack frame. In other words, you can only jump to the same function or to a function that has already been started.

14.3.1 Volatile and setjmp The compiler thinks that setjmp is just a function. However, this is not really the case, because this is the point from which the program can start running again. Under normal conditions, some local variables might have been cached in registers (or never assigned) before the call to setjmp. When we get back to this point by a longjmp call, they won't be restored. Disabling optimizations changes this behavior. Therefore, the optimizations disabled the hiding of bugs related to the use of setjmp. To write correctly, remember that only volatile local variables have values ​​defined after longjmp. They are not restored to their old values, because jmp_buf doesn't save variables on the stack, but preserves the values ​​previous to longjmp. Listing 14-15 shows an example. Listing 14-15. setjmp_volatile.c #includes #includes jmp_buf buf; int main( int argc, char** argv ) { int var = 0; volatile int b = 0; setjmp(buf); if (b < 3) { b++; var++; printf("\n\n%d\n", var); longjmp(buf, 1); } returns 0; } Let's compile it without optimizations (gcc -O0, Listing 14-16) and with optimizations (gcc -O2, Listing 14-17). No optimizations,

277

Chapter 14 ■ Translation Details

Listing 14-16. volatile_setjmp_o0.asm main: pushrbp movrbp,rsp subrsp,0x20; `argc` and `argv` are pushed onto the stack so that `rdi` and `rsi` are available movDWORD PTR [rbp-0x14],edi movQWORD PTR [rbp-0x20],rsi ; var = 0 movDWORD PTR [rbp-0x4],0x0 ; b = 0 movDWORD PTR [rbp-0x8],0x0 ; 0x600a40 is the address of `buf` (a global variable of type `jmp_buf`) movei,0x600a40 call400470 ; if (b < 3), the good branch is performed; This is coded by skipping several instructions to `.endlabel` if b > 2 moveax,DWORD PTR [rbp-0x8] cmpeax,0x2 jg.endlabel ; A fair increase; b++ moveax,DWORD PTR [rbp-0x8] addeax,0x1 movDWORD PTR [rbp-0x8],eax ; var++ addDWORD PTR [rbp-0x4],0x1 ; `printf` calls moveax,DWORD PTR [rbp-0x4] movei,eax movei,0x400684 ; There are no floating point arguments, so rax = 0 moveax,0x0 call400450 ; calling `longjmp` movei,0x1 movei,0x600a40 call400490 .endlabel: Moveax,0x0 Exit ret

278

Chapter 14 ■ Translation Details

The output of the program will be 1 2 3 With optimizations, Listing 14-17. volatile_setjmp_o2.asm main: ; allocating memory on heap subrsp,0x18 ; a `setjmp` argument, the address of `buf` movei,0x600a40 ; b = 0 movDWORD PTR [rsp+0xc],0x0 ; the instructions are placed in a different order; of instructions C to make better use of piping and other internal components; CPU mechanisms. call400470 ; `b` is read and checked correctly moveax,DWORD PTR [rsp+0xc] cmpeax,0x2 jle.branch ; return 0 xoreax,eax addrsp,0x18 ret .branch: moveax,DWORD PTR [rsp+0xc] ; the second argument to `printf` is var + 1; Nor was it read from memory or assigned. 🇧🇷 Calculations were done at compile time movesi,0x1 ; The first argument to `printf` moved i,0x400674; b = b + 1 addeax,0x1 movDWORD PTR [rsp+0xc],eax

279

Chapter 14 ■ Translation Details

xoreax,eax call400450 ; longjmp( buf, 1 ) movei,0x1 movei,0x600a40 call400490 The output of the program will be 1 1 1 The volatile variable b, as you can see, behaved as expected (otherwise the loop would never have ended). The var variable was always equal to 1, despite being “incremented” according to the program text.

■■Question 265 How do you implement try-catch constructs using setjmp and longjmp?

14.4 inline inline is a function qualifier introduced in C99. Mimics the behavior of its C++ counterpart. Before reading an explanation, don't assume that this keyword is used to force the inline function! Prior to C99, there was a static qualifier, which was often used in the following scenario: • The header file does not include the function declaration, but the entire function definition, marked as static. • The header is then added to multiple translation units. They each receive a copy of the emitted code, but since the corresponding symbol is object local, the linker doesn't see it as a multiple definition conflict. On a large project, this gives the compiler access to the function's source code, allowing it to actually line up the function if needed. Of course, the compiler can also decide that it's better to leave the function uninlined. In this case, we started getting clones of this feature pretty much everywhere. Each file calls its own copy, which is bad for locale and bloats the memory image as well as the executable itself. Keyword Online solves this problem. Its correct usage is as follows: • Describes a function inline in a relevant heading, for example, inline int inc( int x ) { return x+1; } • In exactly one translation unit (that is, a .c file), add the outer declaration extern inline int inc( int x );

280

Chapter 14 ■ Translation Details

This file will contain the function code, which will be referenced in all other files, where the function has not been embedded.

■■Semantic change In GCC prior to 4.2.1, the keyword inline had a slightly different meaning. See publication [14] for an in-depth discussion.

14.5 restrict rest is a volatile and const-like keyword that first appeared in the C99 standard. It is used to mark pointers and is therefore placed to the right of the asterisk as follows: int x; int* restrict p_x = &x; If we create a restricted pointer to an object, we promise that every access to that object will go through the value of that pointer. A compiler can ignore this or use it for certain optimizations, which is usually possible. In other words, any writing to another pointer will not affect the value stored by a constrained pointer. Breaking this promise leads to subtle bugs and is a clear case of undefined behavior. Without restrictions, each pointer is a source of possible memory aliases, when you can access the same memory cells using different names for them. Consider a very simple example, shown in Listing 14-18. The field of f is equal to *x += 2 * (*add);? Listing 14-18. restring_motiv.c void f(int* x, int* add) { *x += *add; *x += *sum; } The answer is, surprisingly, no, they are not the same. What if add and x point to the same address? In this case, changing *x also changes *ad. So, if x == add, the function will add *x to *x, doubling the initial value, and then it will repeat, making it four times the initial value. However, when x != add, even if *x == *add, the final *x will be three times the initial value. The compiler knows this very well, and even with optimizations turned on, it won't optimize two reads, as shown in Listing 14-19. Listing 14-19. strict_motiv_dump.asm 0000000000000000 : 0:8b 06moveax,DWORD 2:03 07addeax,DWORD 4:89 07movDWORD PTR 6:03 06addeax,DWORD 8:89 07movDWORD PTR a:c3ret

PTR [rsi] PTR [rdi] [rdi],eax PTR [rsi] [rdi],eax

281

Chapter 14 ■ Translation Details

However, add the constraint, as shown in Listing 14-20, and disassembly demonstrates an improvement, as shown in Listing 14-21. The second argument is read exactly once, multiplied by 2, and added to the first unreferenced argument. Listing 14-20. restrict_motiv1.c void f(int* restrict x, int* restrict add) { *x += *add; *x += *sum; } Listing 14-21. strict_motiv_dump1.asm 0000000000000000 : 0:8b 06moveax,DWORD PTR [rsi] 2:01 c0addeax,eax 4:01 07addDWORD PTR [rdi],eax 6:c3ret Only use the restriction if you are sure of what you are doing. Writing a slightly inefficient program is much better than writing a bad one. It is important to use the constraint for the document ID as well. For example, the signature for memcpy, a function that copies n bytes from some starting address s2 to a block starting with s1, changed in C99: void* memcpy(void*restrict s1, const void* restring s2, size_tn ); This reflects the fact that these two areas should not overlap; otherwise, correction is not guaranteed. Restricted pointers can be copied from one to another to create a hierarchy of pointers. However, the standard limits this to cases where the copy does not reside in the same block as the original pointer. Listing 14-22 shows an example. Listing 14-22. rest_hierarchy.c structure s { int* x; } instant; void f(void) { struct s* constrain p_s = &inst; int* restricts p_x = p_s->x; /* Incorrect */ { int* restricting p_x2 = p_s->x; /* Ok, another block scope */ } }

282

Chapter 14 ■ Translation Details

14.6 Strict Aliasing Before restricting was introduced, programmers sometimes achieved the same effect by using different struct names. The compiler thinks that different data types mean that the respective pointers cannot point to the same data (known as the strict aliasing rule). Assumptions include the following: • Pointers to different built-in types are not aliased. • Pointers to structures or unions with different labels are not aliased (so struct foo and struct bar are never used for each other). • Type aliases, created with typedef, can refer to the same data. • The char* type is exceptional (signed or unsigned). The compiler always assumes that char* can be aliased to other types, but not the other way around. This means we can create a character buffer, use it to get data, and then alias it as an instance of some framework package. Breaking these rules can lead to subtle optimization errors as it triggers undefined behavior. The example shown in Listing 14-18 can be rewritten to achieve the same effect without the restricted keyword. The idea is to use the strict aliasing rules to our advantage, wrapping both parameters in structures with different labels. Listing 14-23 shows the modified font. Listing 14-23. restring-hack.c struct a { int v; 🇧🇷 structure b { int v; 🇧🇷 void f(struct a* x, struct b* add) { x->v += add->v; x->v += add->v; } To our delight, the compiler optimizes the reads exactly as we wanted. Listing 14-24 shows the disassembly. Listing 14-24. rest-hack-dump 00000000000000000 : 0:8b 06moveax,DWORD PTR [rsi] 2:01 c0addeax,eax 4:01 07addDWORD PTR [rdi],eax 6:c3ret We discourage the use of aliasing rules for optimization purposes in code for C99 and later because the restriction makes the intent more obvious and doesn't introduce unnecessary type names.

283

Chapter 14 ■ Translation Details

14.7 Security Issues C was not designed as a language for building robust software. It lets you work directly with memory and has no way to control accuracy, neither static, like Rust, nor dynamic, like Java. Let's review some classic security flaws, which we can now explain in detail.

14.7.1 Stack Buffer Overrun Suppose the program uses a function f with a local buffer, as shown in Listing 14-25. Listing 14-25. buffer_overrun.c #include void f( void ) { char buffer[16]; get(buffer); } int main( int argc, char** argv ) { f(); return 0; 🇧🇷

After initialization, the stack frame layout will look like this:

The get function reads a line from stdin and places it in the buffer, whose address is accepted as an argument. Unfortunately, it does not control the buffer size and can therefore ignore it. If the line is too long, it will overwrite the buffer, the saved value of rbp and the return address. When the ret statement executes, the program will likely crash. Even worse, if the attacker forms a smart line, he can rewrite the return address with specific bytes that make up a valid address.

284

Chapter 14 ■ Translation Details

If the attacker chooses to redirect the return address directly to the overflowing buffer, he could stream executable code directly to that buffer. This code is often referred to as shellcode because it is small and usually only opens a remote shell to work with. Obviously, this is not just a get bug, but a feature of the language itself. The moral is to never use get and always provide a way to check the bounds of the target memory block.

14.7.2 return-to-libc As already explained, the malicious user can rewrite the return address if the program allows it to overflow the stack buffer. The return-to-libc attack is performed when the return address is the address of a function in the standard C library. One function is of particular interest, int system(command const char*). This function allows you to execute an arbitrary shell command. What's worse, it will run with the same privileges as the attacked program. When the current function has finished executing the ret command, we will start executing the libc function. It is still a question, how do we form a valid argument for this? In the presence of ASLR (Address Space Layout Randomization), performing this attack is not trivial (but still possible).

14.7.3 Output format vulnerabilities Output format functions can be a source of nasty bugs. There are several such functions in the standard library; Table 14-1 shows them. Table 14-1. string format functions

Occupation

Description

print

Generates a formatted string.

fprintf

Write the printf to a file.

run

Prints to a string.

snprintf

Prints to a string checking the length.

vfprintf

Prints the va_arg structure to a file.

vprintf

Print the va_arg structure to stdout.

vsprintf

Print va_arg to a string.

vsnprintf

Prints va_arg to a string checking the length.

Listing 14-26 shows an example. Suppose the user enters less than 100 symbols. Can this program crash or produce other interesting effects? Listing 14-26. printf_vuln.c #include int main(void) { char buffer[1024]; get(buffer); printf(buffer); return 0; 🇧🇷

285

Chapter 14 ■ Translation Details

The vulnerability does not come from the usage gained, but from the use of the format string taken from the user. The user can supply a string that contains format specifiers, which will lead to interesting behavior. We will mention several types of potentially unwanted behavior. • The "%x" and similar specifiers can be used to view the contents of the stack. The first 5 "%x" will receive arguments from the registers (rdi is already busy with the format string address), then the next ones will pop the contents of the stack. Let's compile the example shown in Listing 14-26 and see its reaction on an input "%x %x %x %x %x %x %x %x %x %x %x %x". > %x %x %x %x %x %x %x %x %x %x b1b6701d b19467b0 fbad2088 b1b6701e 0 25207825 20782520 78252078 25207825 As we can see, he actually gave us four numbers that share some informal similarity, one 0 and two numbers more. Our assumption is that the last two numbers have already been popped from the stack. By inserting gdb and scanning memory near the top of the stack right after the call to printf, we'll get results that prove our point. Listing 14-27 shows the output. Listing 14-27. GDB_PRINTF (GDB) X/10 $ RSP 0X7FFFFFFFDFE0: 0x252078250x782520780x207825200x25207825 0x7ffffffffdff0: 0x782520780x207825200x252078250x00000078 0x7878787820020000 ° m00 ° m00 —TOFFFFFFFFFOM: "TETAFFFFFFFFFUFFOM. Como una cadena se define por la dirección de su inicio, esto significa dirigirse a la memoria mediante un puntero. Por lo Therefore, if a valid pointer is not supplied, the invalid pointer will be dereferenced.

■■Question 266  What will be the result of running the code shown in Listing 14-26 on the entry "%s %s %s %s %s"? • The "%n" format specifier is a bit exotic, but still harmful. Lets you write an integer number into memory. The printf function accepts a pointer to an integer to be rewritten with the number of symbols written so far (before the occurrence of "%n"). Listing 14-28 shows an example of its use. Listing 14-28. printf_n.c #include int main(void) { int count; printf("hello%n world\n", &count); printf("%d\n", count); return 0; 🇧🇷

286

Chapter 14 ■ Translation Details

This will generate 5 because five symbols were emitted before "%n". This is not a trivial string length because there may be other format specifiers before it which will result in variable length output (for example, printing an integer may yield seven or ten symbols). Listing 14-29 shows an example. Listing 14-29. printf_n_ex.c int x; printf("%d %n", 10, &x);/* x = 3 */ printf("%d %n", 200, &x); /* x = 4 */ To avoid this, do not use the user-accepted string as the format string. You can always write printf("%s", buffer), which is safe as long as the buffer is not NULL and is a valid null-terminated string. Don't forget about features like fputs puts which are not only faster but also more secure.

14.8 Protection Mechanisms Rewriting a return address can have one of the following two consequences: • The program ends abnormally. • The attacker executes arbitrary code. In the first case, we can be victims of a DoS (Denial of Service) attack, when the program, which provides a specific service, is no longer available. However, the second option is much worse.

14.8.1 Security cookie The security cookie (stack guard, canary) should protect us from executing arbitrary code, forcing the program to terminate abnormally as soon as the return address is changed. The security cookie is a random value that resides in the stack frame next to the saved rbp and return address.

287

Chapter 14 ■ Translation Details

Overcoming the buffer will rewrite the security cookie. Before the ret statement, the compiler issues a special check that verifies the integrity of the security cookie and, if changed, crashes the program. The ret statement is not executed. Both MSVC and GCC have this mechanism enabled by default.

14.8.2 Address Space Layout Randomization Loading each program section at a random location in an address space makes it almost impossible to guess a correct return address to perform a smart jump. Most used operating systems support it; however, this feature must be enabled at compile time. In this case, information about ASLR support will be stored in the executable file itself, forcing the loader to do a proper relocation.

14.8.3 DEP We have already discussed data execution prevention in Chapter 4. This technology protects some pages from the execution of instructions stored in those pages. To enable it, programs must also be compiled with support enabled. The sad reality is that it doesn't work well with programs that use just-in-time compilation, which forms executable code during program execution. This isn't as rare as it sounds; for example, virtually all browsers use JavaScript engines that support just-in-time compilation.

14.9 Summary In this chapter, we reviewed the calling convention used in *nix on the Intel 64. We saw examples of usage for more advanced C features, namely restricted and volatile type qualifiers and non-local jumps. Finally, we provide a brief overview of several classic vulnerabilities that are possible due to the way stack frames are organized and the compiler functions that are designed to resolve them automatically. The next chapter will explain more low-level details related to creating and using dynamic libraries to strengthen our understanding of them.

■■Question 267  What are xmm records? How many? ■■Question 268  What are SIMD instructions? ■■Question 269  Why do some SSE instructions require memory operands to be aligned? ■■Question 270  What registers are used to pass arguments to functions? ■■Question 271  When passing arguments to the function, why is rax sometimes used? ■■Question 272 How do I use the rbp registry? ■■Question 273  What is a stack frame? ■■Question 274  Why don't we address local variables relative to rsp? ■■Question 275  What are the prologue and epilogue? ■■Question 276  What is the purpose of input and output statements? 288

Chapter 14 ■ Translation Details

■■Question 277  Describe in detail how the stack frame changes during function execution. ■■Question 278  What is the red zone? ■■Question 279 How do we declare and use a function with a variable number of arguments? ■■Question 280  What kind of context does va_list have? ■■Question 281  Why are functions like vfprintf used? ■■Question 282  What is the purpose of volatile variables? ■■Question 283  Why only volatile stack variables persist after longjmp? ■■Question 284 Are all local variables stack allocated? ■■Question 285  What is setjmp for? ■■Question 286  What is the return value of setjmp? ■■Question 287  What is the use of restricting? ■■Question 288 Can the compiler ignore the constraint? ■■Question 289 How can we get the same result without using the restricted keyword? ■■Question 290 Explain the stack buffer overflow exploit mechanism. ■■Question 291  When is using printf dangerous? ■■Question 292  What is a security cookie? Resolve program crashes on buffer overflow?

289

CHAPTER 15

Shared Objects and Code Models Chapter 5 already provided a brief overview of dynamic libraries (also known as shared objects). This chapter will review dynamic libraries and expand our understanding by introducing the concepts of Program Link Table and Global Compensation Table. As a result, we will be able to build a shared library in pure assembler and C, compare the results and study its structure. We'll also study a concept of code templates, which is rarely discussed but provides a coherent look at several important details of assembly code generation.

15.1 Dynamic Loading As you remember, an ELF (Executable Linkable Format) file contains three headers: • The main header, located at offset zero. Defines general information about the file, including the entry point and offsets for two tables elaborated below. You can see it using readelf -h command. • Section header table, which contains information about different ELF sections. You can see it using readelf -S command. • Program header table, which contains information about file segments. Each segment is a runtime structure, containing one or more sections, defined in the section header table. You can see it using readelf -l command. The initial stage of loading an executable is to create an address space and make memory allocations according to the program header table with the appropriate permissions. This is done by the operating system kernel. Once the virtual address space is set up, the other program must interfere (ie the dynamic loader). The latter must be an executable program and fully relocatable (so it must be able to be loaded in any direction we want). The purpose of the dynamic linker is • Determine all dependencies and load them. • Perform application and dependency relocation. • Initialize the application and its dependencies and pass control to the application. Now the execution of the program will start.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_15

291

Chapter 15 ■ Shared Objects and Code Templates

Determining dependencies and loading them is relatively easy: it boils down to looking for dependencies recursively and checking whether the object has already been loaded or not. Booting isn't too perplexing either. Relocation, however, interests us. There are two types of relocations: • Links to locations within the same object. The static linker is doing all these reallocations as they are known at link time. • Symbol dependencies, which are usually on different object. The second type of relocation is more expensive and is performed by the dynamic linker. Before carrying out the reallocations, we must first carry out a search to find the symbols we want to link. There is a notion of the search scope of an object file, which is an ordered list containing some other loaded objects. The search scope of an object file is used to resolve the required symbols for it. The calculation method is described in [24] and is quite complex, therefore, in case of need, we refer to the corresponding document. The search scope consists of three parts, which are listed in the reverse order of the search, that is, the symbol is searched for in the third part of the scope first. 1. Global search scope, consisting of the executable file and all its dependencies, including dependencies on dependencies, etc. They are listed in a search-wide fashion, ie: • The executable itself. • Your dependencies. • The dependencies of your first dependency, then your second, etc. Each object is loaded only once. 2. The constructed part if the DF_SYMBOLIC flag is set in the metadata of the ELF executable file. It is considered legacy; its use is discouraged, so we will not study it here. 3. Dynamically loaded objects with all their dependencies by calling the dlopen function. They are not searched for normal searches. Each object file contains a hash table that is used for searching.1 This table stores symbol information and is used to quickly find the symbol by name. The first object in the search scope, containing the required symbol, is linked, which allows symbol overloading, for example using the LD_PRELOAD mechanism, which will be explored in Section 15.5. Hash table size and number of exported symbols affect seek time. When the -O flag is given to the linker,2 it tries to optimize these parameters to improve search speed. Remember that in languages ​​like C++, symbol names are not just computed based on, say, the function name, but are encoded in all their namespaces (and the class name), which can easily be of several hundred characters. In the case of collisions in hash tables (which tend to be frequent), the string comparison must be made between the name of the symbol we are looking for and all the symbols in the bucket we have chosen, calculating their hash. Modern GNU-style hash tables provide the added heuristic of using a Bloom3 filter to quickly answer a question: "Is this symbol defined in this object file?" This makes unnecessary lookups much less frequent, which has a positive impact on performance. We won't go into detail about what hash tables are or how they're implemented, but if you're not familiar with them, we recommend reading! This is an absolutely classic data structure used everywhere. You can find a good explanation in [10] 2 Don't confuse this with the -O flag to the compiler! 3 A widely used probabilistic data structure. It allows us to quickly check whether an element is contained in a given set, but the answer "yes" is subject to further verification, while "no" is always right. 1

292

Chapter 15 ■ Shared Objects and Code Templates

15.2 Relocations and PIC Now, what kind of relocations are performed? We saw the relocation process during static linking in Chapter 5. Can we do the same, relocating all code and data elements? The answer is yes we can, and until common architectures added special features to make writing position-independent code easier, it was widely used. However, this approach has the following drawbacks: • Relocations are slow, especially when dependencies are large. This can delay the start of the application. • The .text section cannot be shared because it needs to be fixed. Whereas static linking involves fixing the object file contents when creating the final object file, dynamic linking involves fixing the object files in memory. Not only does this waste memory, it also poses a security risk because, for example, the shellcode can rewrite the program directly into memory to change its behavior. Nowadays, PIC is the recommended way and it allows keeping .text read-only (whereas .data cannot be shared anyway). The number of relocations will be less as there will be no code relocations. The PIC involves the use of two utility tables: Global Offset Table (GOT) and Program Link Table (PLT).

15.3 Example: Dynamic Library in C Before we start studying GOT and PLT, let's create a minimal working example of a dynamic library in C. It's actually quite easy. Our program will consist of two files: mainlib.c (shown in Listing 15-1) and dynlib.c (shown in Listing 15-2). Listing 15-1. external mainlib.c void libfun(int value); global integer = 100; int main(void) { libfun(42); return 0; } Listing 15-2. dynlib.c #include extern int global; void libfun(int value) { printf( "param: %d\n", value ); printf("global: %d\n", global); } As you can see, there is a global variable in the main file, which we want to share with the library; the library explicitly declares that it is external. The main file contains the library function declaration (which is usually placed in the header file, which is supplied with the compiled library).

293

Chapter 15 ■ Shared Objects and Code Templates

To compile these files, the following commands must be executed: > > > > > > >

# creating an object file for the main part gcc -c-o mainlib.o mainlib.c # creating an object file for the library gcc -c -fPIC -o dynlib.odynlib.c gcc -o dynlib.so -shared dynlib. o # creating your own dynamic library # creating an executable and linking it with the dynamic library gcc -o mainmainlib.o dynlib.so

First, we create object files as usual. We then build the dynamic library using the shared flag. When we build an executable, we provide all the dynamic libraries it depends on, as this information must be included in the ELF metadata. Note the use of the -fPIC flag, which forces position-independent code generation. Later we will see the effects of this flag in the assembly. Let's check file dependencies using ldd. > ldd main linux-vdso.so.1 => (0x00007fffcd428000) lib.so => ​​not found libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff988d60000) /lib64 /ld -linux-x86-64.so.2 (0x00007ff989200000) Our new library is present in the dependency list, but ldd cannot find it. An attempt to launch the executable fails with the expected message: ./main: error loading shared libraries: lib.so: cannot open shared object file: No such file or directory Libraries are searched in default locations (such as /lib /). Ours isn't there, so we have another option: an environment variable LD_LIBRARY_PATH is parsed to get a list of additional directories where the libraries can be located. Once we set it in the current directory, ldd finds the library. Note that the search starts with the directories defined in LD_LIBRARY_PATH and continues with the default directories. > export LD_LIBRARY_PATH=. > ldd main linux-vdso.so.1 =>(0x00007ffff1315000) lib.so => ​​./lib.so (0x00007f3a7bc70000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so. 6 ( 0x00007f3a7b890000) /lib64/ld-linux-x86-64.so.2 (0x00007f3a7c000000) The release produces the expected results. > ./parameter main: 42 global: 100

15.4 GOT and PLT 15.4.1 External Variable Access To keep .text read-only and never correct it due to relocations, we added a level of indirection by addressing any symbols that are not guaranteed to be defined in the same object; in other words, for each symbol defined in the executable or shared object file after static linking. This indirect addressing is done through a special global offset table.

294

Chapter 15 ■ Shared Objects and Code Templates

Two facts are important for the PIC code to work. • Intel 64 makes it possible to address instruction operands relative to the pull register. It's possible to get the current value of rip using a couple of call and pop instructions, but hardware support certainly helps with performance. • The offset between the .text section and the .data section is known at link time, that is, when the dynamic library is created. This also means that the distance between the tear and the start of the .data section is also known. Therefore, we place the global compensation table in or close to the .data section. It will contain the absolute addresses of global variables. Let's go to the relatively GOT cell to extract and get an absolute address of the global variable from there; see Figure 15-1.

Figure 15-1. Accessing the global variable via GOT

295

Chapter 15 ■ Shared Objects and Code Templates

Let's see how the global variable, created in the main executable file, is addressed in the dynamic library. To do this, we'll look at a snippet of the objdump -D -Mintel-mnemonic output, shown in Listing 15-3. Listing 15-3. libfun 00000000000006d0 : # Function prolog 6d0: 55pushrbp 6d1: 48 89 e5movrbp,rsp 6d4: 48 83 ec 10subrsp,0x10 # Second 6d8: 89 6db: 8b 6de: 89

argument to printf("parameter: %d\n", value); 7d fcmovDWORD PTR [rbp-0x4],edi 45 fcmoveax,DWORD PTR [rbp-0x4] c6movesi,eax

# First argument to printf( "parameter: %d\n", value); 6e0: 48 8d 3d 32 00 00 00leardi,[rip+0x32] # Printf call; XMM registers are not used 6e7: b8 00 00 00 00moveax,0x0 6ec: e8 bf fe ff ffcall5b0 # second 6f1: 48 6f8: 8b 6fa: 89

argumento para printf("global: %d\n", global); 8b 05 e0 08 20 00movrax,QWORD PTR [rip+0x2008e0] 00moveax,DWORD PTR [rax] c6movesi,eax

# First argument to printf( "global: %d\n", global ); 6fc: 48 8d 3d 21 00 00 00 lerdi,[rip+0x21] # Printf call; no XMM records are used 703: b8 00 00 00 00moveax,0x0 708: e8 a3 fe ff ffcall5b0 # Function epilog 70d: 90nop 70e: c9leave 70f: c3ret Remember that the source code is shown in Listing 15-2. We are interested in seeing how global variables are accessed. First, notice that the first argument to printf (which is the address of the format string, which resides in .rodata) is not accessed in the usual way. In these cases, we used to have an absolute address value (which would have been filled in by the linker during relocation, as explained in Section 5.3.2). However, a rip-relative address is used here. As we understand, rdi as the first argument must contain the format string address. Therefore, this address is stored in memory at address [rip + 0x32]. This place is part of the GOT.

296

Chapter 15 ■ Shared Objects and Code Templates

Now, let's see how the global is accessed from the dynamic library code. In fact, the mechanism is absolutely the same, although one more memory read is required. First we read the GOT content in mov rax,QWORD PTR [rip+0x2008e0] to get the global address, then we read its value accessing memory again in mov eax,DWORD PTR [rax]. Simple enough for global variables. However, for functions, the implementation is a bit more complicated.

15.4.2 Calling External Functions While the same approach could have worked for functions, an additional function is implemented to perform deferred function lookup on demand. Let's first discuss the reasons for this. Looking up symbol definitions is not trivial, as we've seen in this chapter. Typically, there are many more functions than the exported global variables, and only a small fraction of them are actually called during program execution (e.g., error-handling functions). In general, when programmers get a dynamic library to use with their program, they often buy a third-party library that has far more functions than they actually need to call. We've added another level of indirection through the special program binding table (PLT). It resides in the .text section. Each function called by the shared library has an entry in the PLT. Each entry is a small piece of executable code, which is statically linked and therefore can be called directly. Instead of calling a function, whose address would be stored in the GOT, we call the stub input for it. To illustrate, we draw a PLT in Listing 15-4. Listing 15-4. plt_sketch.asm; somewhere in the program call[email protected]🇧🇷 PLT PLT_0:; the common part call resolver... PLT_n:[email protected]: jmp [GOT_n] PLT_n_first: ; aqui prepare os argumentos para resolver jmp PLT_0 GOT: ... GOT_n: dq PLT_n_first

297

Chapter 15 ■ Shared Objects and Code Templates

Now what's going on there? • Function call refers to PLT input without going through GOT. • PLT zero input defines the “common code” of all inputs. Everyone ends up jumping to this post. • An nth entry starts with jumping to an address, stored in the nth GOT entry. The default value of this entry is the address of the next instruction after this jump! In our example, it is indicated by the PLT_n_first tag. So the first time the function is called, we jump to the next instruction, effectively performing a NOP operation. • This code prepares arguments for the dynamic loader and jumps to the common code in PLT_0. • In PLT_0 the loader is called. Performs a lookup and resolves to the function address, populating GOT_n with the actual function address. The following function call will not involve a dynamic loader: the PLT_n stub will be called, which will immediately jump to the resolved function whose address now resides in GOT. See Figures 15-2 and 15-3 for a schematic of changes in the PLT due to the symbol resolution process.

Figure 15-2. PLT before function binding at runtime

298

Chapter 15 ■ Shared Objects and Code Templates

Figure 15-3. PLT after runtime binding function

■■Question 293 Read in man ld.so about environment variables (like LD_BIND_NOT), which can change loader behavior.

15.4.3 Example PLT To be completely fair, we will study the generated code for the example shown in section 15.3. The main function calls libfun, which is done via PLT as you would expect. Disassembling the .text section: 00000000004006a6 : pushrbp movrbp,rsp movei,0x2a call400580 moveax,0x0 poprbp ret

299

Chapter 15 ■ Shared Objects and Code Templates

Next, let's see what the PLT looks like. The PLT entry for libfun is called[email protected]Find it in Listing 15-5. Listing 15-5. PLT_RW.ASM DISASSEMBLY OF .INIT: 0000000000400550

: [rip+0x200a92]# 601008 [rip+0x200a94]# 601010 [rax+0x0]

Disassembly of . The first instruction is a GOT jump to its third element (because each entry is 8 bytes long and the offset is 0x18). Then, the push instruction is issued, whose operand is the function number in PLT. For libfun it is 0x0, for libc_start_main it is 0x1. The next instruction in[email protected]is a jump to _init+0x20, which is weird, but if we check the current address of _init, we see that • _init is at 0x400550. • _init+0x20 is at 0x400570. 🇧🇷[email protected]is also at 0x400570 so they are the same.

300

Chapter 15 ■ Shared Objects and Code Templates

• This address is also the start of the .plt section and, as explained above, must match the "common" code shared by all PLT entries. Pushes one more GOT value onto the stack and gets an address from the GOT dynamic loader to go to. The comments issued by objdump show that the last two values ​​refer to addresses 0x601008 and 0x601010. As we can see, they must be stored somewhere in the .got.plt section, which is the part of the GOT related to PLT entries. Listing 16 shows the contents of this section. Listing 15-6. got_plt_dump_ex.c Section content 0x601000180e6000 0x60101000000000 0x60102096054000

.got.plt: 00000000 00000000 00000000 00000000 86054000 00000000 00000000

Looking closer, we see that starting at address 0x601018 the following bytes are located: 86 05 40 00 00 00 00 00 Remembering the fact that the Intel 64 uses little endian, we conclude that the actual quadword stored here is 0x400586, which is really the direction of[email protected]+ 6, that is, the address of the push instruction 0x0. This illustrates the fact that the initial values ​​of functions in GOT point to the second instructions of their respective PLT inputs.

15.5 Preload Setting the LD_PRELOAD variable allows you to preload shared objects before any other library (including the standard C library). Functions in this library will have a search priority, so they can override functions defined in commonly loaded shared objects. The dynamic loader ignores the LD_PRELOAD value if the effective userid and real userid do not match. This is done for security reasons. Let's write and compile a simple program, as shown in Listing 15-7. Listing 15-7. preload_launcher.c #include int main(void) { puts("Hello world!"); return 0; } It doesn't do anything spectacular, but it's important that you use the puts function, defined in the C standard library. We're going to replace it with our version of puts, which ignores your input and simply outputs a fixed string. When this program starts, the standard put function is executed.

301

Chapter 15 ■ Shared Objects and Code Templates

(Video) I Bought & Ran Jeff Nippard’s Powerbuilding Program (Honest Review)

Now let's make a simple dynamic library with the content shown in Listing 15-8. Represents the puts function with its alternative, which ignores its argument and always outputs a fixed string. Listing 15-8. prelib.c #include int puts( const char* str ) { return printf("We take over your C library! \n"); } We compile using the following commands: > gcc -o preload_launcher preload_launcher.c > gcc -c -fPIC prelib.c > gcc -o prelib.so -shared prelib.o Notice that the executable has not been linked to the dynamic library. Listing 15-9 shows the effect of setting the LD_PRELOAD variable. Listing 15-9. ld_preload_effect > export LD_PRELOAD= > ./a.out Hello world! > export LD_PRELOAD=$PWD/prelib.so > ./a.out We take control of your C library! As we can see, if LD_PRELOAD contains a path to a shared object that defines some functions, they will override other functions that are present in the process address space.

■■Question 294 See homework. Use this technique to test your malloc implementation with some standard coreutils. ■■Question 295 Read about dlopen, dlsym, dlclose functions.

15.6 Symbol Addressing Summary Before we start with the assembly and C examples, let's summarize the possible cases regarding symbol addressing. The main executable file is generally not relocatable or position independent and is loaded using a fixed absolute address, for example 0x40000. anywhere; in other stretches, relocations may be necessary. The symbol can be: 1. Defined in the executable and used locally there. This is trivial, because symbols will be bound to absolute addresses. Data addressing will be absolute, code jumps and calls will normally be generated with tear offsets. 4

This isn't always the case, for example, OS X recommends that all executables be position-independent.

302

Chapter 15 ■ Shared Objects and Code Templates

2. Defined in dynamic library and only used locally (not available for external objects). In the presence of PIC, this is done using rip relative addressing (for data) or relative offsets (for function calls). The more general case will be discussed later in Section 15.10. NASM uses the rel keyword to get relative addressing from the rip. This does not imply GOT or PLT. 3. Set to executable and used globally. This requires using GOT (and also PLT for functions) if the user is external. For internal use, the rules are the same: we don't need GOT or PLT to address within the same object file. 4. Defined in dynamic library and used globally. It should be part of the linked list item rather than a paragraph by itself.

15.7 Examples It is perfectly possible to write a dynamic library in assembly language, which will be position-independent and will use GOT and PLT tables.

■■Linking with gcc The recommended way to link libraries is through GCC. However, for this chapter we will sometimes use more primitive ld to show what is actually done in more detail. When it comes to the C runtime, never use ld. We're also sticking with Intel 64 as usual. PIC code was a bit more difficult to write before the introduction of rip-relative addressing.

15.7.1 Calling a Function In the first example, the following functions will be shown: • Dynamic library data addressing within the same library. • Call a dynamic library function from the main executable file. This example consists of main.asm (Listing 15-10) and lib.asm (Listing 15-11). The Makefile is provided in Listing 15-12 to show the build process. Note that you need to explicitly provide the dynamic linker unless you are using GCC to link files (which will take care of the proper dynamic linker path). See section 15.7.2 for further explanation.

303

Chapter 15 ■ Shared Objects and Code Templates

Listing 15-10. ex1-main.asm extern _GLOBAL_OFFSET_TABLE_ global _start extern section sofun .text _start: call sofun wrt ..plt ; system call `exit` mov rdi, 0 mov rax, 60 syscall The first thing we notice is that extern _GLOBAL_OFFSET_TABLE_ is usually imported into each file that is dynamically linked.5 The main file imports the symbol named sofun. Therefore, the call contains not only the function name, but also the wrt ..plt qualifier. Referencing a symbol using wrt ..plt forces the linker to create a PLT entry. The corresponding expression will be evaluated as an offset of the PLT input from the current position in the code. Prior to the static link, this offset is unknown, but the static linker will fill it in. The type of relocation of this type must be a pull-relative relocation (as used in the call or jmp statements). The ELF structure does not provide a means of addressing PLT entries by their absolute addresses. Listing 15-11. ex1-lib.asm extern _GLOBAL_OFFSET_TABLE_ global sofun: function section .rodata msg: db "SO function called", 10 .end: section .text sofun: mov rax, 1 mov rdi, 1 read rsi, [rel msg] mov rdx , msg.end -msg syscall ret Note that the sofun global symbol is marked as :func (there must be no space before the colon). It is very important to mark exported functions like this in case other objects need to access them dynamically. The .end tag allows us to statically calculate the length of the string to be sent to the recorder call. The important change is the use of the rel keyword.

5

This name is ELF-specific and must be changed for other systems. See section 9.2.1 of [27].

304

Chapter 15 ■ Shared Objects and Code Templates

The code is position independent, so the absolute address of the msg can be arbitrary. Your offset relative to this point in the code (read rsi, instruction [rel msg]) is fixed. Therefore, we can use read to calculate its direction as a tear offset. This line will be compiled to read rsi, [rip + offset], where offset is a constant that will populate the static linker. The last form ([rip + offset]) is syntactically incorrect in NASM. Listing 15-12 shows the Makefile used to build this example. Before starting, make sure the LD_LIBRARY_PATH environment variable includes the current directory; otherwise, just type export LD_LIBRARY_PATH=. for testing purposes and then launch the executable. Listing 15-12. ex1-makefile main: main.o lib.so ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2 main.o lib.so -o main lib.so: lib.o ld - shared lib.o -o lib.so lib.o: nasm -felf64 lib.asm -o lib.o main.o: main.asm nasm -felf64 main.asm -o main.o

■■Question 296 Perform an experiment. Ignore the wrt ..plt construct for the call and recompile everything. Then use objdump -D -Mintel-mnemonic on the resulting main executable to check whether the PLT is still in the game or not. Try to launch it.

15.7.2 About Multiple Dynamic Linkers The dynamic linker is not immutable. It is encoded as part of the metadata in the ELF file and can be viewed using ldd. During linking, you can control which dynamic linker is chosen, for example, ld --dynamic-linker=/lib64/ld-linux-x86-64.so.2 If you don't specify, ld will choose the default path, which may lead to a non-existent file in your case. If the dynamic linker doesn't exist, trying to load the library will result in a cryptic message that doesn't make sense. Suppose you have created a main executable and use a so_lib library and LD_LIBRARY_PATH is set correctly. ./main bash: There is no such file or directory: ./main > ldd ./main linux-vdso.so.1 => (0x00007ffcf7f9f000) so_lib.so => ​​​​​​./so_lib.so (0x00007f0e1cc0a000)

305

Chapter 15 ■ Shared Objects and Code Templates

The problem is that the binding was done without a proper dynamic linker and the ELF metadata doesn't have a correct path. Relinking the object files with a proper dynamic link path should resolve this issue. For example, in the Debian Linux distribution installed in the virtual machine included in this book, the dynamic linker is /lib64/ld-linux-x86-64.so.2.

15.7.3 Accessing an External Variable For the following example, we will make the message string reside in the main executable file; except for that, the code will remain the same. This will allow us to show you how to access the external variable. The main file is shown in Listing 15-13, while the library source is shown in Listing 15-14. Listing 15-13. ex2-main.asm extern _GLOBAL_OFFSET_TABLE_ global _start extern sofun global msg:data (msg.end - msg) section .rodata msg: db "SO function called -- message is stored in 'main'", 10 .end: section . text _start: call sofunwrt ..plt mov rdi, 0 mov rax, 60 syscall Listing 15-14. ex2-lib.asm extern _GLOBAL_OFFSET_TABLE_ global sofun:func extern msg section.text sofun: mov rax, 1 mov rdi, 1 mov rsi, [rel msg wrt ..got] mov rdx, 50 syscall ret It is very important to mark the declaration of dynamically shared data with its size. The size is given as an expression, which can include labels and operations on them, such as subtraction. Without the size, the static linker will treat the symbol as global (visible to other modules during the static linking phase), but the dynamic library will not export it.

306

Chapter 15 ■ Shared Objects and Code Templates

When the variable is declared as global with its size and type (:data), it will reside in the .data section of the executable file instead of in the library. Because of this, you will always have to access the GOT, even in the same file. The GOT, as we know, stores the addresses of global process variables. So, if we want to know the msg address, we need to read an entry from the GOT. However, since the dynamic library is position agnostic, we also need to address its GOT relative to checkout. If we want to read its value, we need an additional memory read after getting its address from the GOT. If the variable is declared in the dynamic library and accessed in the main executable file, it must be done with exactly the same construction: its address can be read from [rel varname wrt ..got]. If you need to store a GOT variable address, use the following qualifier: othervar: dq global_var wrt ..sym For more information, see section 7.9.3 of [27].

15.7.4 Complete Assembly Example Listing 15-15 and Listing 15-16 show a complete example with all the necessary common features of the dynamic library. Listing 15-15. ex3-main.asm extern _GLOBAL_OFFSET_TABLE_ extern fun1 global commonmsg:data commonmsg.end - commonmsg global mainfun:function global _start section .rodata commonmsg: db "fun2", 10, 0 .end: mainfunmsg: db "mainfun", 10, 0 section .text _start : call fun1 wrt ..plt mov rax , 60 mov rdi , 0 syscall mainfun : mov rax , mov rdi , mov rsi , mov rdx , syscall ret

1 1 main function 8

307

Chapter 15 ■ Shared Objects and Code Templates

Listing 15-16. ex3 - lib .asm extern _GLOBAL_OFFSET_TABLE_ extern commonmsg extern mainfun global fun1 : function section .rodata msg : db " fun1 " , 10 section .text fun1 : mov rax , 1 mov rdi , 1 lea rsi , [ rel msg ] mov rdx , 6 syscall call fun2 call mainfun wrt ..plt ret fun2: mov rax, mov rdi, mov rsi, mov rdx, syscall ret

1 1 [rel commonmsg wrt ..got] 5

15.7.5 Mixing C and Assembly Disclaimer: We will provide a compiler and architecture specific example, so the process may vary in your case. However, the core ideas will remain more or less the same. What can complicate mixing C and assembler code is that you have to take the C standard library into account and link everything correctly. The easiest way is to create the object files separately with GCC and NASM respectively and link them to GCC as well. Other than that, there's not much to be afraid of. Listing 15-17 and Listing 15-8 show an example of calling the assembly library from C. Listing 15-17. ex4-main.c #include extern int sofun( void ); outer const char sostr[]; int main( void ) { printf( "%d\n", sofun() ); put(sustr); return 0; 🇧🇷

308

Chapter 15 ■ Shared Objects and Code Templates

In the main file, an external sofun function is called from the dynamic library. Its result is printed to stdout by printf. Then the string, taken from the dynamic library, is generated by puts. Note that the global string is the global character buffer, not a pointer! Listing 15-18. ex4-lib.asm extern _GLOBAL_OFFSET_TABLE_ extern define global sostr:data (sostr.end - sostr) global sofun:section function .rodata sostr: db "sostring", 10, 0 .end: localstr: db "localstr", 10, 0 sofun .text section: read rdi, [rel localstr] call puts wrt ..plt mov rax, 42 ret In the library, sofun is defined just like the global string sostr. sofun calls puts, the standard C library function with the localstr address as an argument. Since the library is written position-independently, the address must be calculated as an offset from the checkout; therefore, the read command is used. This function always returns 42. Listing 15-19 shows the relevant Makefile. Listing 15-19. ex4-Makefile todo: main main: main.o lib.so gcc -o main main.o lib.so lib.so: lib.o gcc -shared lib.o -o lib.so lib.o: lib.asm nasm -felf64 lib.asm -o lib.o main.o: main.asm gcc -ansi -c main.c -o main.o clean: rm -rf *.o *.so main

309

Chapter 15 ■ Shared Objects and Code Templates

15.8 Which objects are linked? The standard C library is usually implemented as one or more static libraries (by defining _start, for example) and a dynamic library, which contains the function we are used to calling. The structure of the library is strictly architecture dependent, but we are going to run several experiments to investigate it. Documentation relevant to our specific case can be found in [3]. How do we find which GCC libraries the executable is linked to? We can do an experiment using GCC with the –v argument. The following is the list of additional arguments that GCC will implicitly accept during final linking according to the Makefile, shown in Listing 15-19: /usr/lib/gcc/x86_64-linux-gnu/4.9/collect2 -plugin /usr /lib/gcc/x86_64-linux-gnu/4.9/liblto_plugin.so -plugin-opt=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper -plugin-opt=-fsolution=/tmp / ccqEOGnU. res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc - plugin-opt=-pass-through=-lgcc_s --sysroot=/ --build-id --eh-frame-hdr -m elf_x86_64 --hash-style=gnu -dynamic-linker /lib64/ld-linux-x86 -64.so.2 -o main /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crt1.o /usr/lib/gcc/x86_64 -linux-gnu/4.9/../../../x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/4.9/crtbegin.o -L/usr/lib/ gcc/x86_64-linux-gnu/4.9 -L/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64 -linux-gnu/4.9/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/. ./lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/4.9/../../. .main.o lib.so -lgcc --as-needed-lgcc_s --not-as-needed -lc -lgcc --as-needed-lgcc_s --not-as-needed /usr/lib/gcc/x86_64- linux-gnu/4.9/crtend.o /usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/crtn.o

310

Chapter 15 ■ Shared Objects and Code Templates

The abbreviation lto stands for "link time optimizations", which we are not interested in. The interesting part consists of additional linked libraries. They are: • crti.o • crtbegin.o • crtend.o • crtn.o • crt1.o ELF files support multiple sections as we know. A separate .init section is used to store the code that will be executed before main, another .fini section is used to store the code that is called when the program ends. The content of these sections is divided into several files. crti and crto contain the prolog and epilogue of the __init function (and likewise for the __fini function). These two functions are called before and after program execution, respectively. crtbegin and crttend contain other utility code included in the .init and .fini sections. They are not always present. We want to repeat that your order is important. crt1.o contains the _start function. To test our claims, let's disassemble the files crti.o, crtn.o, and crt1.o using good old objdump-D-Mintel-mnemonic. Listings 15-20, 15-22, and 15-21 show the refined disassembly. Listing 15-20. da_crti /usr/lib/x86_64-linux-gnu/crti.o:elf64-x86-64 file format Unmounting section .init: 0000000000000000 : 0:subrsp, 0x8 4:movrax, QWORD PTR [rip+0x0]# b b: testrax , rax e:je15 10: call15 Removing the .fini section: 0000000000000000 : 0:subrsp, 0x8 Listing 15-21. da_crtn /usr/lib/x86_64-linux-gnu/crtn.o:file format elf64-x86-64 Unmounting the .init section: 00000000000000000 : 0: addrsp,0x8 4: ret Unmounting the .fini section:

311

Chapter 15 ■ Shared Objects and Code Templates

0000000000000000 : 0: addrsp,0x8 4: ret Listing 15-22. da_crt1 /usr/lib/x86_64-linux-gnu/crt1.o:elf64-x86-64 file format Stripping .text section: 00000000000000000 : 0:xorebp,ebp 2:movr9,rdx 5:poprsi 6:movrdx, rsp 9 : andrsp,0xffffffffffff0 d:pushrax e:pushrsp f:movr8,0x0 16:movrcx,0x0 1d:movrdi,0x0 24:call29 29:hlt As we can see, these form functions end up in the executable. To see the entire code linked and relocated, we'll take a portion of the objdump -D -Mintel-mnemonic output for the resulting file, as shown in Listing 15-23. Listing 15-23. Dasm_init_fini disassembly .init: 00000000004005d8: 4005d8: Subrsp, 0x8 4005DC: MOVRAX, QWORD PTR [RIP+0X200A15]# 600FEF8 4005E3: TESTRAX, RAX 4005E6: JE4005ED 4005E8 Call 400650 5F1: TESTRAX, RAX 4005E6: JE4005ED 4005E8: CALL 400650 5F1: TESTRAX, RAX 4005E6: JE4005ED 4005E8: CALL 400650 5F1005: TESTRAX, RAX 4005E6: JE4005ED 4005E8: CALL 400650 5F10055T. text: 0000000000400660 : 400660:xorebp,ebp 400662:movr9,rdx 400665:poprsi 400666:movrdx,rsp 400669:andrsp,0xffffffffffff0 40066d:pushrax 40066e:pus080sp0:40

312

Chapter 15 ■ Shared Objects and Code Templates

400676:movrcx,0x400790 40067d:movrdi,0x400756 400684:call400640 400689:hlt Dismounting the .fini section: 0000000000400804 : 400804:subrsp,0x8 400808:addrsp04

15.9 Optimizations What affects performance when working with a dynamic library? First, never forget the -fPIC compiler option.6 Without it, even the .text section will be relocated, making dynamic libraries much less attractive to use. It is also crucial to disable some optimizations that can prevent dynamic libraries from working correctly. As we have seen, when the function is declared static in the dynamic library and therefore not exported, it can be called directly without the PLT overhead. Always use static to limit visibility to a single file. It is also possible to control the visibility of symbols in a compiler-dependent manner. For example, GCC recognizes four types of visibility (default, hidden, built-in, protected), of which we are only interested in the first two. The visibility of all symbols together can be controlled using the -fvisibility compiler option, as follows: > gcc -fvisibility=hidden ... # will hide all symbols from the shared object The "default" visibility level implies that all symbols non-static are visible from outside the shared object. Using the __attribute__ directive, we can precisely control symbol-by-symbol visibility. Listing 15-24 shows an example. Listing 15-24. symbol_visibility.c int __attribute__ ((visibility("default"))) func(int x) { return 42; } The good thing you can do is hide all shared object symbols and explicitly mark symbols with default visibility. This way you will fully describe the interface. It's especially nice because no other symbols will be exposed and you can change the internals of the library without breaking binary compatibility of any kind. Data reallocations can slow things down a bit. Whenever a variable in .data stores an address of another variable, it must be initialized by a dynamic linker once the absolute address of the latter is known. Avoid such situations when possible. Since accessing local symbols bypasses the PLT, you may want to only reference "hidden" functions in your code and make wrappers publicly available for the functions you want to export. Only calls to containers will use PLT. Listing 15-25 shows an example.

6

The -fpic option implies a limit on the size of the GOT for some architectures, which is generally faster.

313

Chapter 15 ■ Shared Objects and Code Templates

Listing 15-25. so_adapter.c static int _function(int x) { return x + 1; } override another function( ) { printf(" %d \n", _function( 41 ) ); } function int(int x) { return _function(x); } To eliminate the possible overhead of wrapper functions, there is a technique for writing symbol aliases (which is also compiler specific). GCC handles this using the aliases attribute. Listing 15-26 shows an example. Listing 15-26. gcc_alias.c #include int global = 42; extern int global_alias __attribute__ ((alias("global"), visibility("hidden") )); null fun(void) { puts("1337\n"); } extern void fun_alias( void ) __attribute__ ((alias("fun"), visibility("hidden") )); int tester(void) { printf("%d\n", global); printf("%d\n", global_alias); Fun(); fun_alias(); return 0; } When we compile it using gcc -shared -O3 -fPIC and disassemble it, we see the code shown in Listing 15-27 (disassembled for the test function). Listing 15-27. gcc_aliased_gain.asm ;global -> rsi 787:movrax,QWORDPTR[rip+0x20084a]# 200fd8 78e:moveax,DWORD PTR [rax] 790:moveax,eax 792:leardi,[rip+0x46]# 7df 799:moveax,0x0 79e:call650

314

Chapter 15 ■ Shared Objects and Code Templates

;global_alias -> rsi 7a3:moveax,DWORD PTR [rip+0x20088f]# 201038 7a9:movesi,eax 7ab:leardi,[rip+0x2d]# 7df 7b2:moveax,0x0 7b7:call650 ;calling global `fun` 7bc: call640 ;calling `fun` aliased directly 7c1:call770 Global and global_aliased are treated differently; the latter requires one less memory read. The fun function call is also handled more efficiently, bypassing the PLT and thus saving an extra hop. Finally, remember that zero-initialized globals are always faster to initialize. However, we strongly recommend using global variables. More information about shared object optimizations can be found in [13].

■■Note  The common way to link libraries is using the -l key, for example gcc -hello. The only two differences from specifying the full file path are: • -hello will look for a library called libhello.a (prefixed with lib and with a

extension .a). • The library is searched in the standard list of directories. I'm also looking for custom

directories, which can be provided using the -L option. For example, to include the /usr/libcustom directory and the current directory, you could type: > gcc -lhelo -L. -L /usr/customlibmain.c

Remember that the order in which you provide the libraries is important.

15.10 Code Models Code models are a rarely discussed topic. [24] can be seen as a reference for this subject, and we will discuss code templates in this section. The starting point for the discussion is the fact that the direction in relation to tearing is limited. [15] elaborates that the offset must be an immediate value of at most 32 bits. This leaves us with ±2 GB offsets. Making it possible to use 64-bit offsets directly is a waste, as most code would never use the extra bits; however, these offsets are hardcoded directly into the instructions themselves, which causes the code to take up more space, which is not good for the instruction cache. The address space size is much larger than 32 bits, so what do we do when 32 bits is not enough?

315

Chapter 15 ■ Shared Objects and Code Templates

A code model is a convention that both the programmer and the compiler adhere to; describes restrictions on the program that will use the object file currently being compiled. Code generation depends on it. In short, when the program is relatively small, it's fine to use 32-bit offsets. However, when it can be large enough, the slower 64-bit offsets should be used, which are handled by multiple instructions. The 32-bit offsets correspond to the small code model; 64-bit offsets correspond to the large code model. There is also a type of compromise called a half-code model. All of these models are treated differently in the context of position-dependent and position-independent code, so let's review the six possible combinations. There may be other code models, such as the kernel code model, but we'll leave them out of this volume. If you create your own operating system, you can invent one for your own pleasure. The relevant GCC option is -mcmodel, for example, -mcmodel=large. The default model is the small model.7 The GCC manual says the following about the -mcmodel8 option: -mcmodel=small Generate code for the small code model: the program and its symbols must be linked in the bottom 2 GB of space. Adresses . Pointers are 64-bit. Programs can be linked statically or dynamically. This is the default code template. -mcmodel=kernel Generate code for the kernel code model. The kernel runs in the negative 2 GB of address space. This template should be used for Linux kernel code. -mcmodel=medium Generate code for the medium model: the program binds in the lower 2 GB of address space. Small symbols are also placed there. Tokens with sizes greater than -mlargedata-threshold are placed in large data sections or BSS and can be located above 2 GB. Programs can be linked statically or dynamically. -mcmodel=large Generate code for the large model. This model makes no assumptions about section directions and sizes. To illustrate the differences in compiled code when using different code templates, we'll use a simple example shown in Listing 15-28. Listing 15-28. cm-example.c char glob_small[100] = {1}; char glob_big[10000000] = {1}; static character loc_small[100] = {1}; static character loc_big[10000000] = {1}; int global_f(empty) { return 42; } static int local_f(void) { return 42; } int main(void) { glob_small[0] = 42; global_large[0] = 42; small_location[0] = 42; 7 8

Not all compilers and versions of GCC support the large model. Note that there are different descriptions for different architectures.

316

Chapter 15 ■ Shared Objects and Code Templates

loc_big[0] = 42; global_f(); location_f(); return 0; } We'll use the following line to compile it: gcc -O0 -g cm-example.c The -g flag adds debugging information, such as the .line section, which describes the correspondence between assembly instructions and lines of source code. In this example, there are major and minor matrices. This only matters for the average code model, so we'll omit large array accessors from other disassembly lists.

15.10.1 Small Code Model (No PIC) In the small code model, the program size is limited. All objects must be within 4GB of each other to be linked. Linking can be done statically or dynamically. Since this is the default code template, we won't see anything interesting here. By passing the -S switch to objdump, we will merge the assembler code with the C source lines (if the corresponding file was compiled with the -g flag). The complete script will look like this: gcc -O0 -g cm-example.c -o example objdump -D -Mintel-mnemonic -S example Listing 15-29 shows the compiled assembly. Listing 15-29. mc-small ;glob_small[0] = 42; 4004f0:c6 05 49 0b 20 00 2amovBYTE PTR [rip+0x200b49],0x2a ;loc_small[0] = 42; 4004fe:c6 05 3b a2 b8 00 2amovBYTE PTR [rip+0xb8a23b],0x2a ;global_f(); 40050c:e8 c5 ff ff ffcall4004d6 ;local_f(); 400511:e8 cb ff ff ffcall4004e1 The second column shows us the hexadecimal codes of the bytes that correspond to each instruction. Array accesses are made explicitly relative to the seek and calls accept offsets (which are also implicitly relative to the seek). We can see that the size of the data access instructions is 7 bytes, of which 1 byte is the value (0x2a) and 4 bytes encode the offset relative to the fetch. It illustrates the central idea of ​​the small code model: addressing relative to tears.

317

Chapter 15 ■ Shared Objects and Code Templates

15.10.2 Large code model (no PIC) Now let's compile the same code using the large code model (-mcmodel=large). ;small_balloon[0] = 42; 4004f0:48 b8 40 10 60 00 00movrax,0x601040 4004f7:00 00 00 4004fa:c6 00 2amovBYTE PTR [rax],0x2a ;loc_small[0] 40050a:48 b8 400511:00 04 40051

= 42; 40 a7 f8 00 00movrax,0xf8a740 00 2amovBYTE PTR [rax],0x2a

;global_f(); 400524:48 b8 d6 04 40 00 00movrax,0x4004d6 40052b:00 00 00 40052e:ff d0callrax ;local_f(); 400530:48 b8 e1 04 40 00 00movrax,0x4004e1 400537:00 00 00 40053a:ff d0callrax Both data accesses and calls are done uniformly. We always start by moving an immediate value into one of the general purpose registers and then reference memory using the address stored in that register. , allowing reference to anything anywhere in the 64-bit virtual address space.

15.10.3 Medium code model (no PIC) In the medium code model, arrays larger than the size specified by the -mlarge-data-threshold compiler parameter are placed in a special .ldata and .lbss section. These sections can be placed above the 2GB mark. It's basically a small code template, except for large amounts of data, which are placed separately. In terms of performance, it's better than accessing everything through 64-bit pointers because of locale. Disassembly of sources compiled with -mcmodel=medium is as follows:. small_globe[0] = 42; 400530:c6 05 09 0b 20 00 2amovBYTE PTR [rip+0x200b09],0x2a glob_big[0] = 400537:48 b8 40053e:00 00 400541:c6 00

42; 40 11 a0 00 00movabsrax,0xa01140 00 2amovBYTE PTR [rax],0x2a

loc_pequeño[0]=42; 400544:c6 05 75 0b 20 00 2amovBYTE PTR [rip+0x200b75],0x2a

9

If you find the movebs instruction, consider it equivalent to the mov instruction.

318

Chapter 15 ■ Shared Objects and Code Templates

loc_grande[0] 40054b:48 400552:00 400555:c6

= 42; b8 c0 a7 38 01 00movabsrax,0x138a7c0 00 00 00 2amovBYTE PTR [rax],0x2a

global_f(); 400558:e8 b9 ff ff ffcall400516 local_f(); 40055d:e8 bf ff ff ffcall400521 As we can see, the generated code is using the large model to access the large arrays and the small one for the rest of the accesses. It's pretty smart and might save you if you just need to work with a lot of statically allocated data.

15.10.4 Small PIC Code Model Now let's investigate the position-independent counterparts of these three code models. As before, the small model will not surprise us, because so far we have only worked with a small code model. For convenience, we provide example code compiled with gcc -g -O0 -mcmodel=small -fpic . small_globe[0] = 42; 4004f0:48 8d 05 49 0b 20 00learax,[rip+0x200b49] # 601040 4004f7:c6 00 2amovBYTE PTR [rax],0x2a glob_big[0] = 42; 4004fa:48 8d 05 bf 0b 20 00learax,[rip+0x200bbf] # 6010c0 400501:c6 00 2amovBYTE PTR [rax],0x2a loc_small[0] = 42; 400504:c6 05 35 a2 b8 00 2amovBYTE PTR [rip+0xb8a235],0x2a # f8a740 loc_big[0] = 42; 40050b:c6 05 ae a2 b8 00 2amovBYTE PTR [rip+0xb8a2ae],0x2a # f8a7c0 global_f(); 400512:e8 bf ff ff ffcall4004d6 local_f(); 400517:e8 c5 ff ff ffcall4004e1 Static arrays are easily accessed in regards to extraction as expected. Globally visible arrays are accessed via the GOT, which involves further reading the table itself to get its address.

319

Chapter 15 ■ Shared Objects and Code Templates

15.10.5 Large PIC Code Model Interesting things start to emerge when using a large code model with position-independent code. Now we can't use rip-relative addressing to get to the GOT, because it might have more than 2GB in address space! Because of this, we need to allocate a register to store its address (rbx in our case). # Standard Prolog 400594:55pushrbp 400595:48 89 e5movrbp,rsp # What is this? 🇧🇷 4005af:48 b8 e8 ff ff ffmovabsrax,0xffffffffffffffffe8 4005b6:ff ff 4005b9:48 8b 04 03movrax,QWORD PTR [rbx+rax*1] 4005bd:c6 00 2amovBYTE PTR [rax],0x2a #0 ] Access to local symbols = 42; 4005d1:48 b8 40 97 98 00 00movabsrax,0x989740 4005d8:00 00 00 4005db:c6 04 03 2amovBYTEPTR[rbx+rax*1],0x2a # Calling global function global_f(); 4005ed:49 89 dfmovr15,rbx 4005f0:48 b8 56 f5 df ff ffmovabsrax,0xffffffffffdff556 4005f7:ff ff ff 4005fa:48 01 d8addrax,rbx 4005fd:ff d0callrax # Call the local function local_f(); 4005ff:48 b8 75 f5 df ff ffmovabsrax,0xffffffffffdff575 400606:ff ff ff 400609:48 8d 04 03learax,[rbx+rax*1] 40060d:ff d0callrax return 0; 40060f:b8 00 00 00 00movax,0x0 }

320

Chapter 15 ■ Shared Objects and Code Templates

400614:5bpoprbx 400615:41 5fpopr15 400617:5dpoprbp 400618:c3ret This example needs to be studied carefully. First, we want to break up unusual code in the function's prolog. 400598: 41 57pushr15 40059a: 53pushrbx 40059b: 48 8d 1d f9 ff ff fflearbx, [rip+0xffffffffffff9] # 40059b 4005a2: 49 bb 65 0a 20 00Movabsr11,0x200a45. They are used here to construct the GOT address from the following two components: • The current instruction address, calculated from read rbx,[rip+0xffffffffffffff9]. The operand equals -6, while the instruction itself is 6 bytes long. When executing, the pull value points to the next address after the instruction. • Then the number 0x200a65 is added to rbx. This is done through another register, because the add instruction does not support the addition of a 64-bit immediate operand (see the instruction description in [15]!). • This number is a GOT offset relative to the read address rbx,[rip+0xffffffffffffff9], which, as we know, is always known at link time in position-independent code.10 The ABI assumes that r15 must contain the GOT address in every moment. GCC also uses rbx for convenience. The absolute GOT address is unknown at link time as the code is written to be position independent. Now for data access: the global symbol is accessed via GOT in the same way as in non-PIC code; however, since the GOT address is stored in rbx, we have to calculate the input address using more instructions. # Accessing global symbols glob_small[0] = 42; 4005af:48 b8 e8 ff ff ffmovabsrax,0xffffffffffffffe8 4005b6:ffffff 4005b9:48 8b 04 03movrax,QWORD PTR [rbx+rax*1] 4005bd:c6 00 2amovBYTE PTR [rax],0x2a The input lies relative to the negative input -24 the rbx value (r15). This offset can be arbitrary in length, so we need to store it in a register to account for cases where it cannot be contained in 32 bits. We then load the GOT entry into rax and use that address for our purposes (in this case we store a value at the beginning of the array).

10

Obviously, here r15 and rbx do not contain the beginning of the GOT, but its end, but that doesn't matter.

321

Chapter 15 ■ Shared Objects and Code Templates

Variables that are not visible like other objects are also accessed using GOT. However, we are not reading your GOT addresses. Instead, we use the rbx value as the base (since it points somewhere in the data slice). Each global variable has a fixed offset from that base, so we can choose that offset and use base-indexed addressing mode. # Accessing local symbols loc_small[0] = 42; 4005d1:48 b8 40 97 98 00 00movabsrax,0x989740 4005d8:00 00 00 4005db:c6 04 03 2amovBYTE PTR [rbx+rax*1],0x2a This is obviously faster, so whenever you can, you should prefer to limit the visibility of the symbol as explained in section 15.9 Local functions are called in the same way. Your address is calculated against the GOT and stored in a register. We cannot use just the call command, as its immediate operand is limited to 32 bits (in the description given in [15], there are only operands of type rel16 and rel32, but not rel64). # Calling the local function local_f(); 4005ff:48 b8 75 f5 df ff ffmovabsrax,0xffffffffffdff575 400606:ff ff ff 400609:48 8d 04 03learax,[rbx+rax*1] 40060d:ff d0callrax Calling global functions is done in a more traditional way. Its PLT input is used, whose address is also calculated as a fixed offset to a known GOT position. # Calling the global function global_f(); 4005ed:49 89 dfmovr15,rbx 4005f0:48 b8 56 f5 df ff ffmovabsrax,0xffffffffffdff556 4005f7:ff ff ff 4005fa:48 01 d8addrax,rbx 4005fd:ff d0callrax

15.10.6 Medium PIC Code Model The medium code model, as in non-PIC code, is a mixture of large and small code models. We can think of it as a small model of PIC code with the addition of large arrays residing separately. int main(void) { 40057a:55pushrbp 40057b:48 89 e5movrbp,rsp # Unlike the small model: we store the GOT address locally. 40057e:48 8d 15 7b 0a 20 00leardx,[rip+0x200a7b] glob_small[0] = 42; 400585:48 8d 05 b4 0a 20 00learax,[rip+0x200ab4] 40058c:c6 00 2amovBYTE PTR [rax],0x2a

322

Chapter 15 ■ Shared Objects and Code Templates

global_grande[0] = 42; 40058f:48 8b 05 62 0a 20 00movrax,QWORD PTR [rip+0x200a62] 400596:c6 00 2amovBYTE PTR [rax],0x2a loc_small[0] = 42; 400599:c6 05 20 0b 20 00 2amovBYTE PTR [rip+0x200b20],0x2a loc_big[0] 4005a0:48 4005a7:00 4005aa:c6

= 42; b8 c0 97 d8 00 00movabs rax,0xd897c0 00 00 04 02 2amovBYTE PTR [rdx+rax*1],0x2a

global_f(); 4005ae:e8 a3 ff ff ffcall400556 local_f(); 4005b3:e8 b0 ff ff ffcall400568 returns 0; 4005b8:b8 00 00 00 00moveax,0x0 } 4005bd: 5dpoprbp 4005be:c3ret The GOT address is also in the scope of addressing relative to rip, so its address is loaded with an instruction. 40057e:48 8d 15 7b 0a 20 00leardx,[rip+0x200a7b] Therefore, it is not always necessary to dedicate a record to it, as this address will not be used everywhere. Code hints are considered to be within the range of offsets relative to 32-bit extraction. So calling any function is trivial. global_f(); 4005ae:e8 a3 ff ff ffcall400556 local_f(); 4005b3:e8 b0 ff ff ffcall400568 Regarding data accesses, accesses to global variables are performed consistently regardless of size. The GOT is involved in both cases and contains 64-bit global variable addresses, so we have the ability to address anything for free. small_globe[0] = 42; 400585:48 8d 05 b4 0a 20 00learax,[rip+0x200ab4] 40058c:c6 00 2amovBYTE PTR [rax],0x2a glob_big[0] = 42; 40058f:48 8b 05 62 0a 20 00movrax,QWORD PTR [rip+0x200a62] 400596:c6 00 2amovBYTE PTR [rax],0x2a

323

Chapter 15 ■ Shared Objects and Code Templates

Local variables, however, differ. Small fixes can be accessed in connection with the extract. small_location[0] = 42; 400599:c6 05 20 0b 20 00 2amovBYTE PTR [rip+0x200b20],0x2a Large local arrays are relative to the GOT starting addresses, as in the large model. loc_big[0] 4005a0:48 4005a7:00 4005aa:c6

= 42; b8 c0 97 d8 00 00movabsrax,0xd897c0 00 00 04 02 2amovBYTE PTR [rdx+rax*1],0x2a

15.11 Summary In this chapter, we received the knowledge we need to understand the mechanism behind loading and using dynamic libraries. We wrote a library in assembly language and C and successfully linked it to an executable. For further reading, we refer above all to a classic article [13] and to the ABI description [24]. In the next chapter, we'll talk about compiler optimizations and their effect on performance, as well as specialized instruction set extensions (SSE/AVX) designed to speed up certain types of calculations.

■■Question 297  What is the difference between static and dynamic binding? ■■Question 298  What does the dynamic linker do? ■■Question 299  Can we resolve all dependencies at link time? What kind of system should we work with to make this possible? ■■Question 300  Should we always relocate the .data section? ■■Question 301  Should we always relocate the .text section? ■■Question 302  What is PIC? ■■Question 303  Can we share a .text section between processes when it is being relocated? ■■Question 304  Can we share a .data section between processes when it is being relocated? ■■Question 305  Can we share a .data section during relocation? ■■Question 306  Why are we compiling dynamic libraries with a -fPIC flag? ■■Question 307  Write a simple dynamic library in C from scratch and demonstrate the function called from it. ■■Question 308  What is ldd for? ■■Question 309  Where do you look for libraries? ■■Question 310  What is the LD_LIBRARY_PATH environment variable for?

324

Chapter 15 ■ Shared Objects and Code Templates

■■Question 311  What is GOT? Because it is necessary? ■■Question 312  What makes using the GOT effective? ■■Question 313 Why can this position-independent code address GOT directly but not address global variables directly? ■■Question 314  Is the GOT unique to each process? ■■Question 315  What is PLT? ■■Question 316  Why don't we use GOT to call functions of different objects (or do we)? ■■Question 317  Where does the initial GOT entry for a function point? ■■Question 318 How do we preload a library and what can it be used for? ■■Question 319  In assembler, how do you address the symbol if it is defined in the executable and accessed from there? ■■Question 320  In assembly, how do you address the symbol if it is defined in the library and accessed from there? ■■Question 321  In assembler, how do you address the symbol if it is defined in the executable and is accessed from anywhere? ■■Question 322  In assembly, how do you address the symbol if it is defined in the library and is accessed from anywhere? ■■Question 323 How do we control the visibility of a symbol in a dynamic library? How can we make it private to the library but accessible from anywhere? ■■Question 324  Why do people sometimes write wrapper functions for those used in the library? ■■Question 325 How do we link a library stored in libdir? ■■Question 326  What is a code template and why do we care about code templates? ■■Question 327  What are the limitations of the small code model? ■■Question 328  What overhead is involved in the large code model? ■■Question 329  What is the tradeoff between the large and small code models? ■■Question 330  When is the medium model most useful? ■■Question 331 How are large code models for PIC and non-PIC code different? ■■Question 332 How are the average code models for PIC and non-PIC code different?

325

CHAPTER 16

Performance In this chapter, we'll study how to write code faster. To do this, we will analyze SSE instructions (Streaming SIMD Extensions), study compiler optimizations and hardware cache operation. Please note that this chapter is only an introduction to the topic and will not make you an optimization expert. There is no magic technique to make everything magically fast. The hardware has become so complex that even a guess as to what code is slowing down program execution can fail. Testing and profiling should always be done, and performance should be measured reproducibly. This means that everything related to the environment must be described in such detail that anyone can replicate the experiment conditions and get similar results.

16.1 Optimizations In this section, we want to discuss the most important optimizations that occur during the translation process. They are crucial to understanding how to write quality code. Why? A common type of decision making in programming is the compromise between code readability and performance. Knowing the optimizations is necessary to make good decisions. Otherwise, when choosing between two versions of code, we might choose a less readable one because it "seems" to perform fewer actions. In reality, however, both versions will be optimized with exactly the same sequences of assembly instructions. In that case, we just create less readable code without any benefit.

■■Note  In the lists presented in this section, we will often use a GCC directive of __attribute__ ((noinline)). Applying it to a function definition suppresses the insertion of that function. Exemplary functions are often small, which encourages compilers to inline them, which we don't want to better display various optimization effects. Alternatively, we could have compiled the examples with the -fno-inline option.

16.1.1 Myth about Fast Languages ​​There is a common misconception that the language defines the speed of program execution. Is not true. The best and most useful performance tests are often highly specialized. They measure performance in very specific cases. This prevents us from making bold generalizations. Therefore, when providing performance claims, it is advisable to provide as detailed a description of the test scenario and test results as possible. The description should be sufficient to build a similar system and run similar tests with comparable results.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_16

327

Chapter 16 ■ Performance

There are cases where a program written in C can be superseded by another program that performs similar actions, but written, for example, in Java. It has nothing to do with the language itself. For example, a typical malloc implementation has a particular property: its running time is difficult to predict. In general, it depends on the current state of the heap: how many blocks there are, how fragmented the heap is, etc. In either case, it will likely be larger than allocating memory on a heap. However, in a typical Java Virtual Machine implementation, memory allocation is fast. This happens because Java has a simpler heap structure. With some simplifications, it's just a region of memory and a pointer inside it, delimiting an occupied area from a free one. Allocating memory means moving that pointer further to the free part, which is fast. However, it comes at a cost: to get rid of bits of memory we no longer need, garbage collection is performed, which can halt the program for an unknown amount of time. Imagine a situation where garbage collection never occurs, for example, a program allocates memory, performs calculations and exits, destroying all address space without calling the garbage collector. In this case, it is possible for a Java program to perform faster due to the careful allocation overhead imposed by malloc. However, if we use a custom memory allocator, tailored to our specific needs for a given task, we could do the same trick in C, drastically changing the result. Also, since Java is generally interpreted and compiled at runtime, the virtual machine has access to runtime optimizations based on exactly how the program executes. For example, methods that usually run one after the other can be placed next to each other in memory so that they are cached together. For this, some information about the program execution trace must be collected, which is only possible at runtime. What really sets C apart from other languages ​​is a very transparent cost model. Regardless of what you're writing, it's easy to imagine what assembly instructions will be issued. On the other hand, languages ​​that are primarily intended to work within a runtime (Java, C#) or that provide a number of additional abstractions, such as C++ with its virtual inheritance mechanism, are more difficult to predict. The only two real abstractions that C provides are structures/unions and functions. Being naively translated into machine instructions, a C program runs very slowly. It does not correspond to code generated by a good optimizing compiler. Normally, a programmer has no more knowledge about low-level architecture details than the compiler, which is very necessary for low-level optimization, so it will not be able to compete with the compiler. Otherwise, sometimes for a specific platform and compiler you can change a program, usually reducing its readability and maintainability, but in a way that will speed up the code. Again, performance testing is a must for everyone.

16.1.2 General Tips When programming, you generally shouldn't worry about optimizations right away. Premature optimization is bad for several reasons. • Most programs are written so that only a fraction of their code is executed repeatedly. This code determines how fast the program will run and can slow down everything else. Accelerating other parts under these circumstances will have little or no effect.

The best way to find these pieces of code is to use a profiler, a utility program that measures how often and for how long different pieces of code run.

• Manually optimizing code almost always makes it less readable and more difficult to maintain. • Modern compilers are aware of common patterns in high-level language code. These patterns are well optimized because compiler creators put a lot of work into it, because the work is worth it.

328

Chapter 16 ■ Performance

The most important part of optimizations is often choosing the right algorithm. Low-level optimizations at the assembly level are rarely as beneficial. For example, accessing the elements of an index-linked list is slow, because we have to go through it from the beginning, jumping from node to node. Arrays are most beneficial when program logic requires accessing their elements by index. However, inserting into a linked list is easy compared to an array, because to insert an element at position i into an array, we first have to move all subsequent elements (or maybe even reallocate memory first and copy everything). Simple, clean code is usually also the most efficient. So if the performance is not satisfactory, we need to find the most frequently executed code using the Profiler and try to optimize it manually. Check for duplicate calculations and try to memorize and reuse calculation results. Study the assembly listings and verify that forcing inline insertion for some of the functions used is working. General hardware issues such as locale and cache usage should be considered at this point. We'll talk about them in section 16.2. Compiler optimizations must be considered. We'll cover the basics later in this section. Enabling or disabling specific optimizations for a dedicated file or region of code can have a positive impact on performance. By default, they are usually all on when you compile with the -O3 flag. Only then come the lower-level optimizations: manually entering SSE or AVX (Advanced Vector Extensions) instructions, entering assembly code, writing data bypassing the hardware cache, preloading data into caches before using it, etc. flags -O0, -O1, -O2, -O3, -Os (optimizes space usage, to produce the smallest possible executable file). The index next to -O increases as the set of enabled optimizations grows. Specific optimizations can be turned on and off. Each optimization type has two compiler options associated with it, for example, -fforward-propagate and -fno-forward-propagate.

16.1.3 Ignore Stack Frame Pointer Related GCC options: -fomit-frame-pointer Sometimes we don't need to store the old rbp value and initialize it with the new base value. Occurs when • There are no local variables. • Local variables are set to the red zone AND the function does not call any other functions. However, there is a downside: it means that less information about the state of the program is kept at runtime. We'll have trouble unwinding the call stack and getting local variable values ​​because we don't have information about where the frame starts. It is more problematic in situations where a program crashes and a dump of its state must be analyzed. These dumps are usually highly optimized and lack debugging information. Regarding performance, the effects of these optimizations are generally negligible [26]. The code shown in Listing 16-1 shows how to unwind the stack and display the addresses of frame pointers for all functions started when unwind is called. Compile it with -O0 to avoid optimizations. Listing 16-1. stack_unwind.c void unwind(); void f(int count) { if (count) f(count-1); more relax(); } int main(void){ f(10); return 0; 🇧🇷

329

Chapter 16 ■ Performance

Listing 16-2 shows an example. Listing 16-2. stack_unwind.asm extern printf global unwind section .rodata format : db "%x ", 10, 0 unwind .code section: push rbx ; while (rbx != 0) { ;print rbx; rbx = [rbx]; 🇧🇷 } mov rbx, rbp .loop: test rbx, rbx jz .end mov rdi, format mov rsi, rbx call printf mov rbx, [rbx] jmp .loop .end: pop rbx ret How do we use it? Try it as a last resort to improve the performance of code that involves a large number of offline function calls.

16.1.4 GCC options related to queue recursion: -fomit-frame-pointer -foptimize-sibling-calls Let's look at a function shown in Listing 16-3. Listing 16-3. factorial_tailrec.c __attribute__ (( noinline )) int factorial( int acc, int arg ) { if ( arg == 0 ) return acc; return factorial(acc * arg, arg-1); } int main(int argc, char** argv) { return factorial(1, argc); } It calls itself recursively, but this call is private. Once the call completes, the function will return immediately.

330

Chapter 16 ■ Performance

We say that the function is tail recursive if the function • Returns a value that does not involve a recursive call, for example, it returns 4;. • Is called recursively with other arguments and returns the result immediately without doing any further calculations with it. For example, return factorial ( acc * arg, arg-1 );. A function is not recursive when the result of the recursive call is used in calculations. Listing 16-4 shows an example of a tailless recursive factorial calculation. The result of a recursive call is multiplied by arg before being returned, so there is no tail recursion. Listing 16-4. factorial_nontailrec.c __attribute__ (( noinline )) int factorial( int arg ) { if ( arg == 0 ) return acc; return arg * factorial(arg-1); } int main(int argc, char** argv) { return factorial(argc); } In Chapter 2, we study Question 20, which proposes a solution in the spirit of tail recursion. When the last thing a function does is call another function, which is immediately followed by the callback, we can jump to the beginning of that function. In other words, the following statement pattern may be subject to optimization: ; elsewhere: call f ... ... f: ... call g ret; 1 g: ... ret; 2 The ret instructions in this listing are marked the first and second. Executing the g call will push the return address onto the stack. This is the address of the first ret instruction. When g completes its execution, it executes the second ret instruction, which prints the return address, leaving us at the first ret. Therefore, two ret statements in a row will execute before control is passed to the function that called f. However, why not return to caller f immediately? To do this, we replace call g with jmp g. Now g, we will never go back to function f, nor will we put a useless return address on the stack. The second ret will take the return address of the call f, which should have happened somewhere, and return us directly there.

331

Chapter 16 ■ Performance

🇧🇷 elsewhere: call f ... ... f: ... jmp g g: ... ret; 2 When g and f are the same function, this is exactly the case with tail recursion. When unoptimized, factorial(5, 1) will be thrown five times, polluting the stack with five stack frames. The last call will end up running ret five times in a row to get rid of all return addresses. Modern compilers are generally aware of tail recursive calls and know how to optimize tail recursion in a loop. The assembled listing produced by GCC for the tail recursive factorial (Listing 16-3) is shown in Listing 16-5. Listing 16-5. Factorial_tailrec.asm 0000000000004004C6: 4004C6: 89 F8MoveAX, EDI 4004C8: 85 f6tESTESI, ESI 4004CA: 74 07JE4004D3 4004CC: 0F AF C6IMULEAX, ESI 4004CF: FF04CEDESI1 Enter new records with argument • Fill. Loops are faster than recursion because recursion needs additional stack space (which can also cause a stack overflow). So why not always continue with the cycles? Recursion often allows us to express some algorithms more consistently and elegantly. If we can write a function so that it also becomes recursive, that recursion won't affect performance. Listing 16-6 shows an example function that accesses a list element linked to an index. Listing 16-6. tail_rec_example_list.c #include #include struct llist { struct llist* next; int value; 🇧🇷

332

Chapter 16 ■ Performance

struct llist* llist_at( struct llist* lst, size_t idx ) { if ( lst && idx ) return llist_at( lst->next, idx-1 ); return lst; } struct llist* c( int value, struct llist* next) { struct llist* lst = malloc( sizeof(struct llist*) ); lst->next = next; lst->value = value; return lst; } int main( void ) { struct llist* lst = c( 1, c( 2, c( 3, NULL ))); printf("%d\n", llist_at(lst, 2)->value); return 0; } Compiling with -Os will produce the non-recursive code shown in Listing 16-7. Listing 16-7. tail_rec_example_list.asm 0000000000400596 : 400596:48 89 f8movrax,rdi 400599:48 85 f6testrsi,rsi 40059c:74 0dje4005ab 40059e:48 85 c0testrax,rax 4005a1:74 08je4005ab 4005a3:48 ff cedecrsi 4005a6:48 8b 00movrax,QWORD PTR [rax] 4005a9:eb eejmp400599 4005ab:c3ret How do we use it? Never be afraid to use tail recursion if it makes your code more readable, as it doesn't carry a performance penalty.

16.1.5 Removing Common Subexpressions Related GCC options: -fgcse and others containing substring cse. Evaluating two expressions with a common part does not result in evaluating that part twice. This means that it doesn't make sense, from a performance point of view, to calculate this part in advance, store its result in a variable and use it in two expressions. In an example shown in Listing 16-8, a subexpression x2 + 2x is evaluated once, while the naive approach suggests the opposite.

333

Chapter 16 ■ Performance

Listing 16-8. common_subexpression.c #include __attribute__ ((noinline)) void test(int x) { printf("%d %d", x*x + 2*x + 1, x*x + 2*x - 1); } int main(int argc, char** argv) { test(argc); return 0; } As a test, Listing 16-9 shows the compiled code, which does not calculate x2 + 2x twice. Listing 16-9. common_subexpression.asm 0000000000400516 : ; rsi = x + 2 400516:8d 77 02leaesi,[rdi+0x2] 400519:31 c0xoreax,eax 40051b:0f af f7imulesi,edi ; rsi = x*(x+2) 40051e:bf b4 05 40 00movedi,0x4005b4 ; rdx = rsi-1 = x*(x+2) - 1 400523:8d 56 fleaedx,[rsi-0x1] ; rsi = rsi + 1 = x*(x+2) - 1 400526:ff c6incesi 400528:e9 b3 fe ff ffjmp4003e0 How do we use it? Don't be afraid to write nice formulas with the same common subexpressions - they will calculate efficiently. Improves code readability.

16.1.6 GCC options related to constant propagation: -fipa-cp, -fgcse, -fipa-cp-clone, etc. If the compiler can prove that a variable has a specific value at a certain place in the program, it can skip reading its value and put it directly there. Sometimes it even generates versions of specialized functions, partially applied to some arguments, if you know the exact value of the argument (option -fipa-cp-clone). For example, Listing 16-10 shows the typical case when a specialized function version will be created for sum, which has only one argument instead of two, and the value of the other argument is fixed and equal to 42. Listing 16-10 . constant_propagation.c __attribute__ ((noinline)) static int sum(int x, int y) { return x + y; } int main( int argc, char** argv ) { return sum( 42, argc ); 🇧🇷

334

Chapter 16 ■ Performance

Listing 16-11 shows the translated assembler code. Listing 16-11. constant_propagation.asm 00000000004004c0 : 4004c0:8d 47 2aleaeax,[rdi+0x2a] 4004c3:c3ret Improves when the compiler calculates complex expressions for you (including function calls). Listing 16-2 shows an example. Listing 16-12. cp_fact.c #include int fact( int n ) { if (n == 0) return 1; otherwise, returns n * done(n-1); } int main(void) { printf("%d\n", fact( 4 ) ); return 0; } Of course, the factorial function will always calculate the same result, because that value doesn't depend on user input. GCC is smart enough to precompute this value by clearing the call and replacing the fact(4) value directly with 24, as shown in Listing 16-13. The instruction mov edx, 0x18 puts 2410 = 1816 directly into rdx. Listing 16-13. CP_FACT.ASM 000000000000400450: 400450: 48 83 EC 08SUBRS, 0x8 400454: BA 18 00 00MOVEDX, 0x18 400459: BE 44 07 40 00movesi, 0x40074 40045E: BF 01 00MOVED, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1 . , 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1 c0xoreax,eax 40046c:48 83 c4 08addrsp,0x8 400470:c3ret How do we use it? Named constants are not harmful, nor are constant variables. A compiler can and will precompute anything it can, including functions with no side effects initialized to known arguments. Multiple copies of functions for each distinct argument value can be bad for locale and will cause the executable to grow in size. Keep this in mind if you experience performance issues.

335

Chapter 16 ■ Performance

16.1.7 (Name) Return Value Optimization Copy elision and return value optimization allow us to eliminate unnecessary copy operations. Remember that, naively speaking, local variables are created within the structure of the function stack. So if a function returns an instance of a structural type, it must first create it in its own stack frame and then copy it to the outside world (unless it fits into two general purpose registers rax and rdx ). Listing 16-14 shows an example. Listing 16-14. nrvo.c structure p{ length x; big and; long z; 🇧🇷 __attribute__ ((noinline)) struct p f(void) { struct p copy; copy.x = 1; copy.y = 2; copy.z = 3; return copy; } int main(int argc, char** argv) { volatile structure p inst = f(); return 0; } An instance of struct p named copy is created on the stack frame of f. Its fields are filled with the values ​​1, 2 and 3 and then copied to the outside world, presumably by the pointer accepted by f as a hidden argument. Listing 16-15 shows the resulting assembler code. Listing 16-15. nrvo_off.asm 00000000004004b6 : ; prolog 4004b6:55pushrbp 4004b7:48 89 e5movrbp,rsp ; A hidden argument is the address of a structure that will contain the result. 🇧🇷 It is saved on the stack. 4004ba:48 89 7d d8movQWORD PTR [rbp-0x28],rdi ; Fill in the fields of the `copy` local variable 4004be:48 c7 45 e0 01 00 00movQWORD PTR [rbp-0x20],0x1 4004c5:00 4004c6:48 c7 45 e8 02 00 00movQWORD PTR [rbp-0x18],0x2​ 4004cd :00 4004ce :48 c7 45 f0 03 00 00movQWORD PTR [rbp-0x10],0x3 4004d5:00 ; rax = target framework address

336

Chapter 16 ■ Performance

4004d6:48 8b 45 d8movrax,QWORD PTR [rbp-0x28] ; [rax] = 1 (taken from `copy.x`) 4004da:48 8b 55 e0movrdx,QWORD PTR [rbp-0x20] 4004de:48 89 10movQWORD PTR [rax],rdx ; [rax + 8] = 2 (taken from `copy.y`) 4004da:48 8b 55 e0movrdx,QWORD PTR [rbp-0x20] 4004e1:48 8b 55 e8movrdx,QWORD PTR [rbp-0x18] 4004e5:48 89 50 08movQWORD PTR [rax+0x8],rdx; [rax + 10] = 3 (taken from `copy.z`) 4004e9:48 8b 55 f0movrdx,QWORD PTR [rbp-0x10] 4004ed:48 89 50 10movQWORD PTR [rax+0x10],rdx ; rax =direction where we put the content of the structure; (Fue el Argument Hidden) 4004F1: 48 8B 45 D8MOVRAX, Qword PTR [RBP-0x28] 4004F5: 5DPOPRBP 4004F6: C3RET 0000004004F7: 4004F7: 55PUSHRBP 4004F8 :: 48 89 75 D0MOVQWR 45 E0LELEAX, [RBP-0X20] 40050A: 48 89 C7MOVRDI, RAX 40050D: E8B A4 FF06FFCALL 400512: :c3ret 400519:0f 1f 80 00 00 00 00nopDWORD PTR [rax+0x0] The compiler, as it can produce more efficient code shows in Listing 16-16. Listed 16-16. NRVO_ON.ASM 000000004004B6: 4004B6: 48 89 F8MOVRAX, RDI 4004B9: 48 C7 07 01 00 00MOVQWORD PTR [RDI] 83 EC 20Subrsp, 0x20 4004d5: 48 89 e7movrdi, rsp 4004d8: e8 d9 ffffffd4 ,0x20 4004e6:c3ret 4004e7:66 0f 1f 84 00 00 00nopPALABRA PTR [rax+rax*1+0x40ee: 0x40ee] 400ee

337

Chapter 16 ■ Performance

We don't allocate a place in the stack frame for copying! Instead, we are operating directly on the structure passed to us via a hidden argument. How can we use it? If you want to write a function that populates a given structure, it's generally not beneficial to directly pass a pointer to a pre-allocated area of ​​memory (or to allocate it using malloc, which is also slower).

16.1.8 Influence of branch prediction At the microcode level, actions performed by the CPU (central processing unit) are even more primitive than machine instructions; they are also reordered to better utilize all CPU resources. Branch prediction is a hardware mechanism that aims to improve program execution speed. When the CPU sees a conditional branch instruction (such as jg), it can • Start executing both branches simultaneously; or • Guess which branch to run and start running it. This happens when the result of the calculation (for example, the value of the GF flag in jg[rax]) that this jump destination depends on is not ready yet, so we start running the code speculatively in order not to waste time. The branch prediction unit can fail giving an erroneous prediction. In this case, once the calculation is complete, the CPU will do additional work to reverse the changes made by the offending branch instructions. It's slow and has a real impact on program performance, but wrong predictions are relatively rare. The exact branch prediction logic depends on the CPU model. In general, there are two types of prediction [6]: static and dynamic. • If the CPU does not have hop information (when executed for the first time), a static algorithm is used. A possible simple algorithm is as follows: –– If it is a forward leap, we assume that it does. –– If it is a backward bounce, we assume that it does not occur.

It makes sense why jumps used to implement loops are more likely to occur.

• If a jump has already occurred in the past, the CPU can use more complex algorithms. For example, we can use a ring buffer, which stores information about whether or not the hop has taken place. In other words, it stores the jump history. When using this approach, small size loops that divide the buffer size are good for prediction. The best source of relevant information about the exact CPU model can be found in [16]. Unfortunately, most information about the innards of the CPU is not released to the public. How can we use it? When using if-then-else or switch, start with the most likely cases. You can also use special hints like GCC's __builtin_expect directives, which are implemented as special instruction prefixes to jump instructions (see [6]).

16.1.9 Influence of Execution Units A CPU consists of many parts. Each instruction is executed in several stages and each stage is handled by different parts of the CPU. For example, the first stage is often called instruction fetching and consists of loading instructions from memory1 without thinking about their semantics. 1

We have omitted to talk about instruction caching for the sake of brevity.

338

Chapter 16 ■ Performance

The part of the CPU that performs the operations and calculations is called the execution unit. You are implementing different types of operations that the CPU wants to handle: instruction fetch, arithmetic, address translation, instruction decoding, etc. In fact, CPUs can use it more or less independently. Different instructions are executed in a different number of stages, and each of these stages can be executed by a different execution unit. It allows for interesting circuit usages such as the following: • Get an instruction immediately after the other has been fetched (but has not completed its execution). • Perform multiple arithmetic actions simultaneously even though they are described sequentially in assembly code. CPUs in the Pentium IV family were already capable of executing four arithmetic instructions simultaneously under the right circumstances. How do we use the knowledge about the existence of the execution unit? Let's look at the example shown in Listing 16-17. Listing 16-17. looper nonpar_arith_cycle.asm: movrax,[rsi] ; The next instruction depends on the previous one. 🇧🇷 This means that we cannot trade them because; the behavior of the program will change. xorrax, 0x1 ; One more dependency here add[rdi],rax addrsi,8 addrdi,8 decrcx jnzlooper Can we make it faster? We see dependencies between instructions, which get in the way of the CPU's microcode optimizer. What we're going to do is unwind the loop so that two iterations of the old loop becomes one iteration of the new one. Listing 16-18 shows the result. Listing 16-18. cyclo_par_arith.asm looper: movrax,[rsi] movrdx,[rsi + 8] xorrax,0x1 xorrdx,0x1 add[rdi], rax add[rdi+8], rdx addrsi, 16 addrdi, 16 subrcx, 2 jnzlooper

339

Chapter 16 ■ Performance

Now that the dependencies are gone, the two iteration statements are now mixed up. They will run faster in this order because it improves simultaneous use of different CPU execution units. Dependent statements must be placed away from each other to allow other statements to work between them.

■■Question 333  What is instruction pipeline and superscalar architecture? We cannot tell you which execution units are on your CPU, because this is highly model dependent. We have to read optimization manuals for a specific CPU, like [16]. Additional sources are often helpful; for example, Haswell processors are well explained in [17].

16.1.10  Grouping Reads and Writes in Code The hardware works best with streams of reads and writes that are not interleaved. For this reason, the code shown in Listing 16-19 is typically slower than the code shown in Listing 16-20. The latter has streams of sequential reads and writes grouped together instead of interleaved. Listing 16-19. rwgroup_bad.asm move move move move move move move move move move move

rax,[rsi] [rdi],rax rax,[rsi+8] [edi+4],eax rax,[rsi+16] [rdi+16],rax rax,[esi+24] [rdi+24] ,ex

Listing 16-20. rwgroup_good.asm mover mover mover mover mover mover mover mover

rax, [rsi] rbx, [rsi+8] rcx, [rsi+16] rdx, [rsi+24] [rdi], rax [rdi+8], rbx [rdi+16], rcx [rdi+24] , rdx

16.2 Caching 16.2.1 How do we use caching effectively? Caching is one of the most important mechanisms for increasing performance. We covered general caching concepts in Chapter 4. This section will delve further into how to use these concepts effectively. We want to start by explaining that, contrary to the spirit of von Neumann architecture, mainstream CPUs have been using separate caches for instructions and data for at least 25 years. Instructions and code almost always reside in different regions of memory, which explains why separate caches are more effective. We are interested in data caching.

340

Chapter 16 ■ Performance

By default, all memory operations involve caching, deleting pages marked with write bits, and clearing the cache (see Chapter 4). The cache contains small 64-byte blocks of memory called cache lines, aligned on a 64-byte boundary. Cache memory is different from main memory at the circuit level. Each cache line is identified by a label, an address of the respective piece of memory. Using a special circuit it is possible to recover the cache line by its address very fast (but only for small caches, like 4MB per processor, otherwise it is very expensive). When trying to read a value from memory, the CPU will first try to read it from the cache. If it is missing, the relevant chunk of memory will be loaded into the cache. This situation is called a cache leak and it usually has a big impact on program performance. There are usually several levels of cache; each of them is bigger and slower. The LL cache is the last cache level closest to main memory. For programs with good locality, caching works fine. However, when the locale is broken by a piece of code, it makes sense to bypass the cache. For example, writing values ​​to a large chunk of memory that won't be accessed anytime soon is best done without caching. The CPU tries to predict which memory addresses will be accessed in the near future and preloads the relevant memory parts into the cache. It favors sequential memory accesses. This gives us two important ground rules needed to use caches efficiently. • Try to secure the location. • Favor sequential memory access (and design data structures with this point in mind).

16.2.2 Preloading It is possible to issue a special hint to the CPU to indicate that a certain memory area will be accessed soon. On Intel 64, this is done by a prefetch instruction. Accepts an address into memory; the CPU will do its best to preload it into cache in the near future. This is used to prevent cache misses. Using prefetching can be quite effective, but it must be combined with testing. It should run before the data accesses itself, but not too close. Cache prefetching is performed asynchronously, meaning that it runs at the same time as subsequent instructions execute. If the prefetch is too close to the data accesses, the CPU will not have enough time to prefetch the data from the cache and a cache miss will still occur. Also, it is very important to understand that "near" and "far" from data access mean the position of the statement on the execution trace. We don't necessarily need to place the prefetch close to the program structure (in the function itself), but we should choose a location that precedes data access. It can be located in a completely different module, for example in the registration module, which is usually executed before data access. Of course, this is very bad for code readability, introduces non-obvious dependencies between modules, and is a "last resort" technique. To use prefetch in C, we can use one of GCC's built-in functions: void __builtin_prefetch (const void *addr, ...) Will be replaced with an architecture-specific prefetch instruction. In addition to the address, it also accepts two parameters, which must be integer constants. 1. Shall we read from this address (0, default) or write (1)? 2. How strong is the locale? Three for maximum location to zero for minimum. Zero indicates that the value can be cleared from the cache after use, 3 means that all cache levels must continue to retain it. The CPU prefetches itself if it can predict where the next memory access is likely to take place. While it works fine for continuous memory accesses such as traversing arrays, it starts to become ineffective once the memory access pattern starts to look random to the predictor.

341

Chapter 16 ■ Performance

16.2.3 Example: Binary Search with Prefetching Let's look at an example shown in Listing 16-21. Listing 16-21. prefetch_binsearch.c #include #include #include #define SIZE 1024*512*16 int binarySearch(int *array, size_t number_of_elements, int key) { size_t low = 0, high = number_of_elements-1, mid; while (low key) high = medium-1; } returns -1; } int main() { size_t i = 0; int NUM_LOOKUPS = SIZE; int *array; int *searches; srand(time(NULL)); array =malloc(SIZE*size(int)); lookups = malloc(NUM_LOOKUPS * sizeof(int)); for (i=0;i /usr/bin/time -v ./matrix_init_ra Timed Command: "./matrix_init_ra" User time (seconds): 2.40 System time (seconds): 1.01 Percentage of CPU that got this job: 86% Elapsed time (wall clock) (h:mm:ss or m:ss): 0:03.94 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average heap size (kbytes): 0 Average total size (kbytes): 0 Maximum resident pool size (kbytes): 889808

346

Chapter 16 ■ Performance

Average Resident Pool Size (kbytes): 0 Major Page Faults (requires I/O): 2655 Minor Page Faults (recover one frame): 275963 Voluntary Context Switches: 2694 Involuntary Context Switches: 548 Swaps: 0 file system: 132368 File system output: 0 Socket messages sent: 0 Socket messages received: 0 Tokens delivered: 0 Page size (bytes): 4096 Status output: 0 > /usr/bin/time - v . /matrix_init_linear Timed Command: "./matrix_init_linear" User time (secs): 0.12 System time (secs): 0.83 Percent of CPU that got this job: 92% Elapsed time (wall clock) (h : mm:ss or m:ss): 0:01.04 Average shared text size (kbytes) : 0 Average unshared data size (kbytes): 0 Average heap size (kbytes): 0 Average total size (kbytes) : 0 Max resident pool size (kbytes): 900280 Average resident pool size (kbytes): 0 Largest (I/O required) Page faults a: 4 Minor Page Faults (Frame Recovery): 262222 Voluntary Context Keys: 29 Unintentional Context Keys: 449 Exchanges: 0 File System Inputs: 176 File System Outputs: 0 Sent Socket Messages: 0 Socket messages received: 0 Tokens delivered: 0 Page size (bytes): 4096 Status output: 0 Execution is much slower due to cache misses, which can be checked using the valgrind utility with the cachegrind module, as shown in Listing 16-29. Listing 16-29. cachegrind_matrix_bad > valgrind --tool=cachegrind ./matrix_init_ra ==17022== ==17022== --17022-==17022== ==17022== ==17022==

Command: ./matrix_init_ra warning: Found L3 cache, using its data for LL simulation. Irefs:268,623,230 I1failures:809

347

Chapter 16 ■ Performance

==17022== ==17022== ==17022== ==17022== ==17022== ==17022== ==17022== ==17022== ==17022== ==17022== ==17022== ==17022== ==17022==

LLI MISS: 804 I1Miss Rate: 0.00% LLI Miss Rate: 0.00% DREFS: 67,163,682 (40,974 Rd+ 67,122,708 WR) D1Misses: 67,111,793 (2,384 RD+ 67,109,409 WR) LLD miss: 67,384 (2,384 RD+ 67,109,409 WR) 2.384 RD+ 67.109.409 WR) LLD miss: 67.01%+1, nominal error of 0.17%, 1 99.9%(5.0%+100.0%) References LL: 67 112 602 (3193 readings+ 67 109). 409 wr) LL errors: 67 112 212 (2838 reads+ 67 109 374 wr) LL error rate: 20.0% (0.0%+1) )

As we can see, accessing memory sequentially radically reduces cache misses: ==17023== ==17023== --17023-==17023== ==17023== ==17023== ==17023= = = =17023 = = ==17023== ==17023== ==17023== ==17023== ==17023== ==17023== ==17023== ==17023== ==17023= = = =17023 = = ==17023==

Command: ./matrix_init_linear warning: L3 cache found, using its data for LL simulation. Irefs:336,117,093 I1misses:813 LLi misses:808 I1miss rate:0.00% LLi miss rate:0.00% Drefs:67,163,675(40,970 rd+ 67,122,705 wr) D1misses:16,780,146( 2,384 rd+ 16,777,762 wr) LLd misses:16,779,760( 2,033 rd+ 16,777,727 wr) D1miss Rate: 25.0% (5.8% + 25.0%) LL Error Rate: 25.0% (5.0% + 25.0%) LL References: 16,780,959 (3,197 reads + 16,777,762 answers) LL errors: 16,780,568 (2,841 reads + 16,777,727 answers) % failure rate: 4.2 0.0%+25.0%)

■■Question 334 Take a look at the GCC man pages, “Optimizations” section.

16.3  SIMD instruction class The von Neumann computational model is sequential in nature. It does not assume that some operations can be performed in parallel. However, over time it became clear that performing actions in parallel is necessary for better performance. It is possible when calculations are independent of each other. For example, to add 1 million integers, we can calculate the sum of blocks of 100,000 numbers across ten processors and then sum the results. It is a typical type of task that is well solved by the map reduction technique [5]. We can implement parallel execution in two ways.

348

Chapter 16 ■ Performance

• Parallel execution of several sequences of instructions. This can be achieved by introducing additional processor cores. We'll discuss multithreaded programming that makes use of multiple cores in Chapter 17. • Parallel execution of actions required to complete a single statement. In this case, we can have instructions that call several independent computations that cover different parts of the processor circuitry, which can be exploited in parallel. To implement these instructions, the CPU needs to include multiple ALUs to realize real performance gains, but it doesn't need to be able to execute multiple instructions concurrently. These instructions are called SIMD (Single Instruction, Multiple Data) instructions. In this section, we'll look at CPU extensions called SSE (Streaming SIMD Extensions) and their newer analogue AVX (Advanced Vector Extensions). Unlike SIMD instructions, most of the instructions we have studied so far are of the SISD (Single Instruction, Single Data) type.

16.4  SSE and AVX Extensions The SIMD instructions are the basis for the SSE and AVX instruction set extensions. Most of them are used to perform operations on multiple pairs of data; for example, mulps can multiply four pairs of 32-bit floats at once. However, their single-operand pair counterparts (such as mulss) are now a recommended way to perform all floating-point arithmetic. By default, GCC will generate SSE instructions to operate on floating point numbers. They accept operands in xmm registers or in memory.

■■Consistency  We omit the inherited floating point dedicated stack description for brevity. However, we want to point out that all parts of the program must be translated using the same method of floating point arithmetic: stack floating point or SSE instructions. We'll start with an example shown in Listing 16-30. Listing 16-30. simd_main.c #include #include void sse( float[static 4], float[static 4] ); int main() { float x[4] = {1.0f, 2.0f, 3.0f, 4.0f}; float y[4] = {5.0f, 6.0f, 7.0f, 8.0f}; sse(x, y); printf("%f %f %f %f\n", x[0], x[1], x[2], x[3] ); return 0; 🇧🇷

349

Chapter 16 ■ Performance

In this example, there is an sse function defined elsewhere, which accepts two arrays of floats. Each of them must be at least four elements wide. This function performs calculations and modifies the first matrix. We call compressed values ​​if they fill an xmm record with consecutive memory cells of the same size. In Listing 16-30, float x[4] is four packed single-precision float values. We'll define the sse function in the assembly file shown in Listing 16-31. Listing 16-31. simd_asm.asm section .text global sse; rdi = x, rsi = y sse: movdqa xmm0, [rdi] mulpsxmm0, [rsi] addpsxmm0, [rsi] movdqa [rdi], xmm0 ret This file defines the sse function. It executes four SSE instructions: • movdqa (MOVe Double Qword Aligned) copies 16 bytes of memory pointed to by rdi to register xmm0. We saw this instruction in section 14.1.1. • mulps (MULTiply Packed Single Precision Floating Point Values) multiplies the contents of xmm0 by four consecutive floating values ​​stored in memory at the address obtained from rsi. • addps (ADD Packed Singled Precision Floating Point) adds back the contents of four consecutive float values ​​stored in memory at the address obtained from rsi. • movdqa copies xmm0 into memory pointed to by rdi. In other words, four pairs of floats are multiplied and then the second float of each pair is added to the first. The naming pattern is common: the action semantics (mov, add, mul…) with suffixes. The first suffix is ​​P (packaged) or S (scalar, for single values). The second is D for double precision values ​​(double in C) or S for single precision values ​​(float in C). We want to emphasize again that most SSE instructions only accept memory-aligned operands. To complete the task, you will need to study the documentation for the following instructions using the Intel Software Developer's Manual [15]: • movsd: moves the double precision floating point scalar value. • movdqa: move quadruple aligned double word. • movdqu: moves the misaligned double quad word. • mulps – Multiplication of packed single-precision floating-point values. • mulpd – Multiplication of compressed double-precision floating-point values. • addps – Adds packed single-precision floating-point values. • haddps: horizontal plug-in packaged in single FP.

350

Chapter 16 ■ Performance

• shufps: Randomly compressed single-precision floating-point values. • unpcklps – Unpacks and packs compressed double-precision floating point values. • packswb: package with signed saturation. • cvtdq2pd – Converts packed Dword integers to double precision packed FP values. These instructions are part of the SSE extensions. Intel has introduced a new extension called AVX, which has new registers ymm0, ymm1, ..., ymm15. They are 256 bits wide, their least significant 128 bits (bottom half) can be accessed like old xmm records. New instructions are usually prefixed with v, for example vbroadcastss. It's important to understand that if your CPU supports AVX instructions, it doesn't mean they are faster than SSE! The different processors of the same family are not differentiated by the set of instructions, but by the number of circuits. Cheaper processors will likely have fewer ALUs. Let's use mulps with ymm registers as an example. It is used to multiply 8 pairs of floats. The best CPUs will have enough ALUs (Arithmetic Logic Units) to multiply all eight pairs simultaneously. Cheaper CPUs will only have, say, four ALUs, so they'll have to go through the microcode level twice, multiplying the first four pairs and then the last one. The programmer will not notice that when using instructions, the semantics are the same, but in terms of performance, yes. A single AVX version of mupls with ymm registers and eight pairs of floats can be even slower than two SSE versions of mupls with xmm registers with four pairs each!

16.4.1 Task: Sepia Filter In this task, we will create a program to perform a sepia filter on an image. A sepia filter makes a vividly colored image look like an old photograph. Most graphic editors include a sepia filter. The filter itself is not difficult to code. Recalculates the red, green, and blue components of each pixel based on the old red, green, and blue values. Mathematically, if we think of a pixel as a three-dimensional vector, the transformation is nothing more than the multiplication of a vector by a matrix. Let the new pixel value be (B G R) T (where the superscript T represents the transpose). B, G and R represent blue, green and red levels. In vector form, the transformation can be described as follows: æ B ö æ b ö æ c11 c12 ç ÷ ç ÷ ç ç G ÷ = ç g ÷ ´ ç c 21 c 22 ç R ÷ ç r ÷ çc è ø è ø è 31 c 32

c13 ö ÷ c 23 ÷ c 33 ÷ø

In scalar form, we can rewrite it as B = bc11 + gc12 + rc13 G = bc21 + gc22 + rc23 R = bc31 + gc32 + rc33 In the task given in Section 13.10, we coded a program to rotate the image. If you've thought about your architecture well, it will be easy to reuse most of your code. We will have to use saturation arithmetic. This means that all operations like addition and multiplication are limited to a fixed range between a minimum and maximum value. Our typical machine arithmetic is modular: if the result is greater than the maximum value, we find the range. For example, for unsigned characters: 200 + 100 = 300 mod 256 = 44. Saturation arithmetic implies that for the same range between 0 and 255 200 + 100 = 255 is included, as it is the maximum value in the range.

351

Chapter 16 ■ Performance

C doesn't implement this arithmetic, so we'll have to check for overflows manually. SSE contains instructions that convert floating-point values ​​to overflowed single-byte integers. Performing the transformation in C is easy. Requires direct coding of matrix to vector multiplication and taking saturation into account. Listing 16-32 shows the code. Listing 16-32. image_sepia_c_example.c #include struct pixel { uint8_t b, g, r; 🇧🇷 struct image { uint32_t width, height; structure pixel matrix*; 🇧🇷 unsigned static char sat(uint64_t x) { if (x < 256) return x; return 255; } static void sepia_one( struct pixel* const pixel ) { static const float c[3][3] ={ { .393f, .769f, .189f }, { .349f, .686f, .168f }, { .272f, .543f, .131f } }; old const pixel struct = *pixel; pixel->r = sat( old.r * c[0][0] + old.g * c[0][1] + old.b * c[0][2] ); pixel->g = sat( old.r * c[1][0] + old.g * c[1][1] + old.b* c[1][2] ); pixel->b = sat( old.r * c[2][0] + old.g * c[2][1] + old.b * c[2][2] ); } void sepia_c_inplace( struct image* img ) { uint32_t x,y; for( y = 0; y < image->height; y++ ) for( x = 0; x < image->width; x++ ) sepia_one( pixel_of( *img, x, y ) ); } Note that using uint8_t or unsigned char is very important. In this task you should • Implement in a separate file a routine to apply a filter to a large part of the image (except perhaps the last few pixels). It will operate on chunks of multiple pixels at once using SSE instructions.

352

Chapter 16 ■ Performance

The last few pixels that didn't fill the last chunk can be processed one by one using the C code given in Listing 16-32. • Make sure the C and assembly versions produce similar results. • Compile two programs; the former should use a naive C approach and the latter should use SSE instructions. • Compare C and SSE runtime using a huge image as input (preferably hundreds of megabytes). • Repeat the comparison several times and average the SSE and C values. To make a noticeable difference, we have to have as many operations in parallel as we can. Each pixel consists of 3 bytes; after converting its components to floats it will occupy 12 bytes. Each xmm record is 16 bytes wide. If we want to be effective, we also have to use the last 4 bytes. To achieve this, we use a 48-byte frame, which corresponds to three xmm registers, components of 12 pixels and 4 pixels. Let the subscript denote the index of a pixel. The image looks like this: b1g1r1b2g2r2b3g3r3b4g4r4... We would like to calculate the first four components. Three of them correspond to the first pixel, the fourth corresponds to the second. To perform the necessary transformations, it is useful to first place the following values ​​in the registers: xmm0 = b1b1b1b2 xmm1 = g1g1g1g2 xmm2 = r1r1r1r2 Let's store the matrix coefficients in the xmm registers or in memory, but it is important to store the columns, not the rows. To demonstrate the algorithm, we'll use the following initial values: xmm3 = c11|c21|c31|c11 xmm4 = c12|c22|c32|c12 xmm5 = c13|c23|c33|c13 We use mulps to multiply these packed values ​​with xmm0 …xmm2 . xmm3 = b1c11|b1c21|b1c31|b2c11 xmm4 = g1c12|g1c22|g1c32|g2c12 xmm5 = r1c13|r1c23|r1c33|r2c13 The next step is to add them using the addps instructions. Similar actions should be performed with two other parts of the 16-byte wide frame, containing g2r2b3g3 and r3b4g4r4. This technique using the matrix of transposed coefficients allows us to deal with instructions without horizontal addition like haddps. It is described in detail in [19]. To measure time, use getrusage(RUSAGE_SELF, &r) (read the getrusage man pages first). Fills a struct r of type struct rusage whose field r.ru_utime contains a field of type struct timeval. It contains, in turn, a pair of values ​​for seconds spent and milliseconds spent. Comparing these values ​​before and after the transformation, we can deduce the time spent in the transformation.

353

Chapter 16 ■ Performance

Listing 16-33 shows an example of a single time measurement. Listing 16-33. run_time.c #include #include #include #include #include

int main(void) { struct rusage r; construct the beginning of the time interval; end of temporary structure; getrusage(RUSAGE_SELF, &r); start = r.ru_utime; for( uint64_t i = 0; i < 100000000; i++ ); getrusage(RUSAGE_SELF, &r); end = r.ru_utime; long resolution = ((end.tv_sec - start.tv_sec) * 1000000L) + end.tv_usec - start.tv_usec; printf("Elapsed time in microseconds: %ld\n", res); return 0; } Use a table to do a quick conversion from unsigned character to float. float const byte_to_float[] = { 0.0f, 1.0f, 2.0f, ..., 255.0f };

■■Question 335 Read about methods for calculating the confidence interval and calculate the 95% confidence interval for a reasonably large number of measurements.

16.5 Summary In this chapter, we talked about compiler optimizations and why they are necessary. We've seen how far optimized translated code can go from its initial version. We then studied how to get the most benefit from caching and how to parallelize instruction-level floating-point calculations using SSE instructions. In the next chapter, we'll see how to parallelize the execution of sequences of instructions, create multiple threads, and change our view of memory in the presence of multithreading.

354

Chapter 16 ■ Performance

■■Question 336  Which GCC options control optimization options globally? ■■Question 337  What types of optimizations can potentially yield the greatest benefits? ■■Question 338  What kind of advantages and disadvantages can omitting a frame pointer bring? ■■Question 339 How does a tail recursive function differ from a regular recursive function? ■■Question 340  Can any recursive function be rewritten as a recursive tail without using additional data structures? ■■Question 341  What is a common subexpression delete? How does this affect our code writing? ■■Question 342  What is constant propagation? ■■Question 343  Why should we mark functions as static whenever possible to help with compiler optimizations? ■■Question 344  What benefits does named return value optimization provide? ■■Question 345  What is a branch prediction? ■■Question 346  What are the Branching, Global History, and Local History PivotTables? ■■Question 347  See notes on branch prediction for your CPU in [16]. ■■Question 348  What is an execution unit and why do we care about it? ■■Question 349 How are AVX instruction speed and number of execution units related? ■■Question 350  What types of memory access patterns are good? ■■Question 351  Why do we have many cache levels? ■■Question 352  In what cases can prefetching lead to performance improvements and why? ■■Question 353  What are SSE instructions used for? ■■Question 354  Why do most SSE instructions require inline operands? ■■Question 355 How do we copy data from general purpose registers to xmm registers? ■■Question 356  In what cases is it worth using SIMD instructions?

355

CHAPTER 17

Multithreading In this chapter, we'll explore the multithreading capabilities provided by the C language. Multithreading is a topic for a book in itself, so we'll focus on language features and abstract machine-relevant properties rather than C language-relevant properties. practices and issues related to program architecture. Until C11, support for multithreading was external to the language itself, via non-standard libraries and tricks. A part of it (atomics) is now implemented in many compilers and provides a standard-compliant way to write multithreaded applications. Unfortunately, to date, threading support itself is not implemented in most toolchains, so we will use the pthreads library to write example code. We will still use the atomic ones that meet the standard. This chapter is by no means an exhaustive guide to multithreaded programming, which is a beast worth writing a dedicated book about, but it will introduce you to the most important concepts and relevant features of the language. If you want to master it, we recommend lots of practice, expert articles, books like [34] and code reviews from your more experienced colleagues.

17.1 Processes and Threads It's important to understand the difference between two key concepts involved in most conversations about multithreading: threads and processes. A process is a resource container that collects all kinds of information and runtime resources that a program needs to run. A process contains the following: • An address space, partially filled with executable code, data, shared libraries, other allocated files, and so on. Parts of it can be shared with other processes. • All other types of associated states, such as open file descriptors, logs, and so on. • Information such as Process ID, Process Group ID, User ID, Group ID. 🇧🇷 🇧🇷 • Other resources used for interprocess communication: pipes, semaphores, message queues. 🇧🇷 🇧🇷 thread is a stream of instructions that the operating system can schedule to execute. The operating system does not schedule processes, but threads. Each thread lives as part of a process and has a process state, which it owns. • Records. • Stack (technically this is defined by the stack pointer register; however, since all processor threads share the same address space, one of them can access other threads' stacks, although this is rarely a good idea).

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_17

357

Chapter 17 ■ Multiple Topics

• Important properties for the scheduler, such as priority. • Pending and blocked signals. • Signal Mask. When the process exits, all associated resources are released, including all its threads, open file descriptors, etc.

17.2 What makes multithreading difficult? Multithreading allows you to use multiple processor cores (or multiple processors) to run threads at the same time. For example, if one thread is reading a file from a disk (which is a very slow operation), the other can use the pause to perform CPU-heavy calculations, spreading the CPU (central processing unit) load more evenly over time. So it might be faster if your program can take advantage of it. The threads generally must work with the same data. As long as the data is not modified by any of them, there is no problem working with it, as reading the data has no effect on the execution of other threads. However, if the shared data is being modified by one (or more) threads, we run into a number of problems, such as the following: • When does thread A see the changes made by B? • In what order do threads change data? (As we saw in Chapter 16, instructions can be reordered for optimization purposes.) • How can we perform operations on complex data without interfering with other threads? When these problems are not resolved correctly, a very problematic type of error appears, difficult to detect (because it only appears casually, when instructions from different threads are executed in a specific and unfortunate order). Let's try to establish an understanding and study these problems and how they can be solved.

17.3 Order of Execution When we started studying the abstract machine C, we got used to thinking that the sequence of C statements corresponds to the actions executed by the machine instructions compiled, naturally, in the same order. Now is the time to dive into the more pragmatic details of why this really isn't the case. We tend to describe algorithms in a way that is easier to understand, and that's almost always a good thing. However, the order given by the programmer is not always optimal in terms of performance. For example, the compiler might want to improve locale without changing the semantics of the code. Listing 17-1 shows an example. Listing 17-1. ex_locality_src.c char x[1000000], y[1000000]; ... x[4] = y[4]; x[10004] = y[10004]; x[5] = y[5]; x[10005] = y[10005];

358

Chapter 17 ■ Multiple Topics

Listing 17-2 shows a possible translation result. Listing 17-2. ex_locality_asm1.asm mov al,[rsi + 4] mov [rdi+4],al mov al,[rsi + 10004] mov [rdi+10004],al mov al,[rsi + 5] mov [rdi+5], al mov al,[rsi + 10005] mov [rdi+10005],al However, of course this code can be rewritten to ensure better locality; that is, first assign x[4] and x[5], then assign x[10004] and x[10005], as shown in Listing 17-3. Listing 17-3. ex_locality_asm2.asm move move move move

al,[rsi + 4] [rdi+4],al al,[rsi + 5] [rdi+5],al

mov al,[rsi + 10004] mov [rdi+10004],al mov al,[rsi + 10005] mov [rdi+10005],al The effects of these two instruction sequences are similar if the abstract machine considers only one CPU: given any initial state of the machine, the resulting state after its executions will be the same. The second translation result usually works faster, so the compiler may prefer it. This is the simple case of memory reordering, a situation where memory accesses are reordered relative to source code. For single-threaded applications, which run "truly sequentially", we can generally expect the order of operations to be irrelevant as long as the observable behavior does not change. This freedom ends as soon as we start communicating between threads. Most inexperienced programmers don't think much about it because they are limited to single thread programming. Nowadays, we can't stop thinking about parallelism anymore because of how widespread it is and often the only thing that can really improve program performance. Therefore, in this chapter, we will talk about memory reordering and how to configure it correctly.

17.4 Strong and Weak Memory Models Memory rearrangements can be performed by the compiler (as shown above), or by the processor itself, in microcode. Typically both are being made and we will be interested in both. A uniform ranking can be created for all of them. A memory model tells us what kinds of reordering of load and store instructions to expect. We're not interested in the exact instructions used to access memory most of the time (mov, movq, etc.), we're just concerned with reading or writing memory.

359

Chapter 17 ■ Multiple Topics

There are two extreme poles: weak and strong memory models. As with strong and weak writing, most existing conventions fall somewhere in the middle, closer to one or the other. We have found a classification by Jeff Preshing [31] useful and will retain it in this book. According to him, memory patterns can be divided into four categories, listed from the most relaxed to the strongest. 1. Really weak. In these models, any kind of memory reordering can occur (as long as the observable behavior of a single-threaded program doesn't change, of course). 2. Weak with data dependency order (like ARM v7 hardware memory model). Here we are talking about a particular type of data dependency: the one that exists between payloads. Occurs when we need to grab an address from memory and then use it to perform another lookup, e.g. Mov rdx, [rbx] mov rax, [rdx] In C, this is the situation when we use the ➤ operator to get for a field of a given structure via the pointer to that structure. Really weak memory models do not guarantee data dependency order. 3. Normally strong (such as Intel's hardware memory model 64). It means that there is a guarantee that all stores are made in the same order in which they are provided. Some loads, however, can be moved. The Intel 64 generally falls into this category. 4. Sequentially consistent. This can be described as what you see when debugging an unoptimized program stepping on a single processor core. Memory reordering never occurs.

17.5 Reordering Example Listing 17-4 shows an exemplary situation in which memory reordering can give us a bad day. Here two threads execute the instructions contained in functions thread1 and thread2, respectively. Listing 17-4. mem_reorder_sample.c int x = 0; int y = 0; thread1 empty (empty) { x = 1; print(e); } void thread2(void) { y = 1; print(x); 🇧🇷

360

Chapter 17 ■ Multiple Topics

Both threads share variables x and y. One of them stores in x and then loads the value of y, while the other does the same, but with y and x. We are interested in two types of memory access: loading and storing. In our examples, we will often omit all other actions for simplicity. As these instructions are completely independent (operate on different data), they can be reordered within each thread without changing the observable behavior, giving us four options: store + load or load + store for each of the two threads. That's what a compiler can do for its own reasons. For each option there are six possible execution orders. They represent how the two threads move forward in time relative to each other. We show them as sequences of 1 and 2; if the first thread made a step, we write 1; otherwise, the second took a step. 1. 1-1-2-2 2. 1-2-1-2 3. 2-1-1-2 4. 2-1-2-1 5. 2-2-1-1 6. 1-2 -2-1 For example, 1-1-2-2 means that the first process performed two steps and then the second process did the same. Each sequence corresponds to four different scenarios. For example, the sequence 1-2-1-2 encodes one of the traits, shown in Table 17-1: Table 17-1. Possible instruction execution sequences if processes take turns like 1-2-1-2

ID DO TEMA

TRACKING 1

TRACKING 2

TRACKING 3

TRACKING 4

1

store x

store x

charge and

charge and

2

shop and

collect x

shop and

collect x

1

charge and

charge and

store x

store x

2

collect x

shop and

collect x

shop and

If we look at these possible traces for each execution order, we get 24 scenarios (some of which will be equivalent). As you can see, even for the small examples these numbers can be big enough. Anyway, we don't need all these possible traces; we are interested in the relative charge position to be stored for each variable. Even in Table 17-1, many possible combinations are present: both x and y can be stored and then loaded or loaded and then stored. Of course, the load result depends on whether there was a store before. If reorders weren't in the game, we would be limited: either of the two specified loads would have to be preceded by a store because that's how it is in the source code; programming instructions in a different way cannot change that. However, because reordering is present, we can sometimes get an interesting result: if both threads have their instructions reordered, we get the situation shown in Listing 17-5. Listing 17-5. mem_reorder_sample_happened.c int x = 0; int y = 0; thread1 empty(empty) { print(y); x = 1; 🇧🇷

361

Chapter 17 ■ Multiple Topics

thread2 empty(empty) { print(x); y = 1; } If the 1-2-*-* strategy is chosen (where * denotes either thread), we run load x and load y first, which will make them appear equal to 0 for everyone using the results of those loads. it is possible if the compiler reorders these operations. But even if well controlled or disabled, memory reordering, performed by the CPU, can still produce such an effect. This example demonstrates that the output of such a program is highly unpredictable. Later we will study how to limit the reordering by the compiler and by the CPU; We will also provide code to demonstrate this hardware rearrangement.

17.6 What's Volatile and What's Not The C memory model we're using is pretty weak. Consider the following code: int x,y; x = 1; y = 2; x = 3; As we've seen, instructions can be reordered by the compiler. Furthermore, the compiler may deduce that the first assignment is dead code, because another assignment to the same variable x follows. Since it is useless, the compiler might even remove this declaration. The volatile keyword solves this problem. It forces the compiler to never optimize reads and writes to this variable, and it also suppresses any possible statement reordering. However, it only applies these restrictions to a single variable and does not guarantee the order in which writes to different volatile variables are issued. For example, in the code above, even changing the type x and y to volatile int will impose an order in the assignments of each, but still allow us to interleave freely written as follows: volatile int x, y; x = 1; x = 3; y = 2; Or like this: volatile int x, y; y = 2; x = 1; x = 3; Obviously, these guarantees are not enough for multi-threaded applications. You cannot use volatile variables to organize a shared data access, because these accesses can be freely moved.

362

Chapter 17 ■ Multiple Topics

To securely access shared data, we need two safeguards. • Reading or writing actually takes place. The compiler may have cached the value in the registry and never wrote it back to memory. This is the guarantee that Volatile offers. It is sufficient for memory-mapped I/O (input/output), but not for multi-threaded applications. • No memory reordering should be performed. Let's imagine that we use a volatile variable as a flag, indicating that some data is ready to be read. The code prepares the data and then sets the flag; however, reordering can place this mapping before the data is prepared. Both hardware and compiler reordering are important here. This guarantee is not provided by volatile variables. In this chapter, we study two mechanisms that provide the following two guarantees: • Memory barriers. • Atomic variables, introduced in C11. Volatile variables are rarely used. They suppress optimization, which is generally not something we want to do.

17.7 Memory Barriers A memory barrier is a special statement or statement that imposes restrictions on how rearrangements can be performed. As we saw in Chapter 16, compilers or hardware can use many tricks to improve average-case performance, including reordering, lazy memory operations, speculative loading or branch prediction, file caching, and so on. variables in registers, etc. Checking them is vital to ensure that certain operations have already been done, as the other thread's logic depends on it. In this section, we want to introduce the different types of memory barriers and give an overview of their possible implementations on Intel 64. An example of a memory barrier that prevents reordering by the compiler is the following GCC directive: asm VOLATIL(" " : : : "memory") The asm directive is used to include inline assembly code directly into C programs. The volatile keyword along with the clobber argument of "memory" describes that this (empty) piece of inline assembly cannot be optimized or moved and that performs memory reads and/or writes. Because of this, the compiler is forced to issue code to commit all operations to memory (e.g., storing the values ​​of local variables, cached in registers). This does not prevent the processor from performing speculative reads beyond that instruction, so it is not a memory barrier for the processor itself. Obviously, compiler and CPU memory barriers are expensive because they prevent optimizations. That's why we don't want to use them after every statement. There are several types of memory barriers. We will talk about those defined in the Linux kernel documentation, but this classification is applicable in most situations. 1. Write memory barrier. Ensures that all store operations specified in the code before the barrier will appear before all store operations specified after the barrier. GCC uses volatile asm(""::: "memory") as a general memory barrier. Intel 64 uses the sfence instruction.

363

Chapter 17 ■ Multiple Topics

2. Read memory barrier. Likewise, it ensures that all load operations specified in the code before the barrier appear to occur before all load operations specified after the barrier. It is a stronger form of data dependency barrier. GCC uses volatile asm(""::: "memory") as a general memory barrier. Intel 64 uses the lfence instruction. 3. Data dependency barriers. The data dependency barrier considers the dependent reads, described in Section 17.4. Therefore, it can be considered a weaker form of read memory barrier. No guarantees are given on independent loads or any type of storage. 4. General Memory Barriers This is the last barrier, which forces every memory change specified in the code before being committed. It also prevents all subsequent operations from being reordered so that they appear to run before the barrier. GCC uses volatile asm(""::: "memory") as a general memory barrier. Intel 64 uses the mfence instruction. 5. Acquire business. This is a class of operations, united by a property called Acquire Semantics. If an operation reads from shared memory and is guaranteed not to be reordered by subsequent reads and writes in source code, it is said to have this property. In other words, it's similar to a general memory barrier, but the following code won't be reordered to run before that barrier. 6. Release operations. Release semantics are a property of such operations. If an operation writes to shared memory and is guaranteed not to be reordered by previous reads and writes in source code, it is said to have this property. In other words, it's similar to a general memory barrier, but still allows the most recent operations to be reordered to a position before the free operation. Acquisition and release operations, therefore, are one-way barriers to new orders in some form. The following is an example of a single mfence assembly command, integrated by GCC: asm ("mfence" ) Combined with the compiler barrier, we get a line that prevents compiler reordering and also acts as a full memory barrier. volatile asm("mfence" ::: "memory") Any function call whose definition is not available in the current translation unit and which is not intrinsic (a cross-platform substitute for a specific assembly instruction) is a memory barrier from the compiler.

364

Chapter 17 ■ Multiple Topics

17.8 Introduction to POSIX Threads (pthreads) is a pattern that describes a particular model of program execution. Provides means to run code in parallel and control execution. It is implemented as a pthreads library, which we will use throughout this chapter. The library contains C types, constants, and procedures (which are prefixed with pthread_). Its declarations are available in the pthread.h header. The functions it offers are divided into one of the following groups: • Basic thread management (creation, destruction). • Mutual exclusion management. • Condition Variables. • Synchronization across blocks and barriers. In this section we will study several examples to familiarize ourselves with pthreads. To perform multithreaded calculations, you have the following two options: • Spawn multiple threads in the same process. Threads share the same address space, so exchanging data is relatively easy and fast. When the process ends, all of its threads also end. • Generate multiple processes; each of them has its own default thread. These threads communicate through mechanisms provided by the operating system (such as pipes). That's not so fast; Also, spawning a process is slower than spawning just a thread, because it creates more operating system structures (and a separate address space). Interprocess communication usually involves one or more (sometimes implicit) copy operations. However, separating program logic into separate processes can have a positive impact on security and robustness, as each thread only sees the exposed part of processes it does not own. Pthreads allows you to spawn multiple threads in a single process, which is what you normally want to do.

17.8.1 When to use multithreading Multithreading is sometimes convenient for program logic. For example, you normally shouldn't accept a network packet and draw the GUI on the same thread. The GUI must react to user actions (button clicks) and constantly redraw itself (eg when the corresponding window is covered by another window and then uncovered). However, the network action will block the worker thread until it completes. Therefore, it is convenient to split these actions into different threads to execute them seemingly simultaneously. Multithreading can naturally improve performance, but not in all cases. There are CPU bound tasks and IO bound tasks. • CPU bound code can be accelerated giving you more CPU time. It spends most of its CPU time doing calculations, not reading data from disk or communicating with devices. • I/O bound code cannot be accelerated with more CPU time because it slows down due to excessive memory usage or external devices.

365

Chapter 17 ■ Multiple Topics

Using multithreading to speed up CPU-bound programs can be beneficial. A common pattern is to use a queue with requests sent to worker threads from a thread pool: a pool of created threads that are working or waiting for work, but are not recreated each time they are needed. 🇧🇷 See Chapter 7 of [23] for more details. As for how many threads we need, there is no universal recipe. Creating threads, switching between them, and scheduling them results in overhead. This can slow down the entire program if there isn't much work for the threads to do. For compute-heavy tasks, some people recommend spawning n − 1 threads, where n is the total number of processor cores. In tasks that are sequential by their very nature (where each step directly depends on the previous one), spawning multiple threads will not help. What we recommend is to always experiment with the number of threads in different workloads to find out which number best suits the task at hand.

17.8.2 Creating Topics Creating topics is easy. Listing 17-6 shows an example. Listing 17-6. pthread_create_mwe.c #include #include #include void* threadimpl( void* arg ) { for(int i = 0; i < 10; i++ ) { puts( arg ); sleep(1); } returns NULL; } int main(void) { pthread_t t1, t2; pthread_create( &t1, NULL, threadimp, "fizz" ); pthread_create( &t2, NULL, threadimpl, "buzz" ); pthread_exit(NULL); put("bye"); return 0; } Note that code that uses the pthread library must be compiled with the -pthread flag, for example > gcc -O3 -pthread main.c Specifying -lpthread will not give us an estimated result. Linking with the single libpthread.a is not enough: there are several preprocessor options that are enabled by -pthread (eg _REENTRANT). Therefore, whenever the -pthread option is available, use it.

1

This option is documented as platform-specific, so it may not be available on some platforms.

366

Chapter 17 ■ Multiple Topics

Initially, there is only one thread that starts executing the main function. A pthread_t type stores all information about some other thread, so we can control it using that instance as an identifier. The threads are then initialized with the pthread_create function with the following signature: int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg); The first argument is a pointer to the pthread_t instance to initialize. The second is a collection of attributes, which we'll cover later; for now, it's safe to pass NULL. The thread start function must accept a pointer and return a pointer. It accepts a void* pointer for its argument. Only one argument is allowed; however, you can easily create a structure or array that encapsulates multiple arguments and pass a pointer to it. The return value of start_routine is also a pointer and can be used to return the result of the thread's work.2 The last argument is the actual pointer to the argument, which will be passed to the start_routine function. In our example, each thread is implemented the same way: it accepts a pointer (to a string) and then displays it repeatedly with an interval of about a second. The suspend function, declared in unistd.h, suspends the current thread for a specified number of seconds. After ten iterations, the thread returns. It is equivalent to calling the pthread_exit function with one argument. The return value is usually the result of calculations performed by the thread; return NULL if not needed. Later we will see how it is possible to obtain this value from the main thread.

■■Convert to void  Constructs like (void)argc have only one purpose: to suppress warnings about unused variable or argc argument. Sometimes you can find them in the source code. However, naive return from the main function will lead to process termination. What if other threads still exist? The main thread must wait for it to finish first! This is what pthread_exit does when called on the main thread: it waits for all other threads to finish and then exits the program. All of the following code won't run, so you won't see the farewell message on standard output. This program will generate a couple of buzz and fizz lines in random order ten times and then exit. It is impossible to predict whether the first or second thread will be scheduled first, so the order might be different each time. Listing 17-7 shows an exemplary result. Listing 17-7. pthread_create_mwe_out > ./main fizz buzz buzz buzz buzz buzz buzz buzz buzz

2

Remember not to return the address of a local variable!

367

Chapter 17 ■ Multiple Topics

buzz buzz buzz buzz buzz buzz buzz buzz buzz buzz buzz As you can see, the bye string is not printed, because the corresponding puts call is below the pthread_exit call.

■■Where are the arguments? It is important to note that the pointer to an argument passed to a thread must point to data that remains active until the thread is terminated. Passing a pointer to the stack allocated variable can be risky as after the stack frame for the function is destroyed, accessing these deallocated variables results in undefined behavior. Unless the arguments are constants (or you intend to use them for synchronization purposes), don't pass them to different threads. In the example shown in Listing 17-6, the strings that threadimpl accepts are allocated in global read-only memory (.rodata). So passing a pointer to it is safe. The maximum number of threads spawned is implementation dependent. On Linux, for example, you can use ulimit -a to get relevant information. Threads can create other threads; there is no limitation for this. In fact, the implementation of pthreads ensures that a call to pthread_create acts as a complete compiler memory barrier as well as a complete hardware memory barrier. pthread_attr_init is used to initialize an instance of an opaque type pthread_attr_t (implemented as an incomplete type). Attributes provide additional parameters for threads, such as stack size or address. The following functions are used to set attributes: • pthread_attr_setaffinity_np – The thread will prefer to run on a specific CPU core. • pthread_attr_setdetachstate: Will we be able to call pthread_join on this thread or will it be detached (instead of joined). The purpose of pthread_join will be explained in the next section. • pthread_attr_setguardsize – Sets the space before the stack boundary to a region of forbidden addresses of a certain size to detect stack overflows. • pthread_attr_setinheritsched – Are the next two parameters inherited from the main thread (where creation took place) or are they obtained from the attributes themselves? • pthread_attr_setschedparam – This is currently the scheduling priority, but additional parameters may be added in the future.

368

Chapter 17 ■ Multiple Topics

• pthread_attr_setschedpolicy – ​​How the scheduling will be done. Scheduling policies with their respective descriptions can be seen in man sched. • pthread_attr_setscope – Refers to the contention scope system, which defines a set of threads against which this thread will compete for CPU (or other resources). • pthread_attr_setstackaddr: where will the stack be located? • pthread_attr_setstacksize – How big will the thread stack be? • pthread_attr_setstack: defines the stack address and size. They all have their "get" counterparts (eg pthread_attr_getscope).

■■Question 357 Read the man pages for the functions listed above. ■■Question 358  What will sysconf(_SC_NPROCESSORS_ONLN) return?

17.8.3 Thread Management What we have learned is enough to do work in parallel. However, we still don't have a means of synchronization, so once we've distributed the work across the threads, we can't actually use the results of one thread's computation on other threads. The simplest form of synchronization is the screw connection. The idea is simple: by calling thread_join on an instance of pthread_t, we put the current thread in a wait state until the other thread finishes. Listing 17-8 shows an example. Listing 17-8. thread_join_mwe.c #include #include #include void* worker( void* param ) { for( int i = 0; i < 3; i++ ) { puts( (const char*) param ); sleep(1); } return (void*)"done"; } int main(void) { pthread_t t; null result*; pthread_create( &t, NULL, worker, (void*) "I am a worker!"); pthread_join(t, &result); puts((const char*) result); return 0; 🇧🇷

369

Chapter 17 ■ Multiple Topics

thread_join accepts two arguments: the thread itself and the address of a void* variable, which will be initialized with the result of the thread's execution. The thread join acts as a complete barrier because we must not put any reads or writes scheduled to occur after the join before the join. By default, threads are created linkable, but you can create a separate thread. This can have a certain benefit: the resources of the detached thread can be released immediately after its termination. However, the joined thread will be waiting to join before its resources can be released. To create a separate thread • Create an instance of the pthread_attr_t attr attribute; • Initialize it with pthread_attr_init( &attr ); • Call pthread_attr_setdetachstate( &attr, PTHREAD_CREATE_DETACHED ); and • Create the thread using pthread_create with &attr as the attribute argument. The current thread can be explicitly changed from attachable to detachable by calling pthread_detach(). It is impossible to do otherwise.

17.8.4 Example: Distributed Factoring We chose a simple CPU-bound program to count the factors of a number. First, let's solve it using the most trivial brute force method on a single core. Listing 17-9 shows the code. Listing 17-9. dist_fact_sp.c #include #include #include #include #include

uint64_t fatores( uint64_t num ) { uint64_t resultado = 0; para (uint64_t i = 1; i gcc -O3 -std=c99 -o fact_sp dist_fact_sp.c

370

Chapter 17 ■ Multiple Topics

Começaremos a paralelização com uma versão simplificada do código multithread, que sempre executará cálculos em dois threads e não será arquitetonicamente bonito. A Listagem 17-10 mostra isso. Listagem 17-10. dist_fact_mp_simple.c #include #include #include entrada int = 0; int resultado1 = 0; void* fact_worker1( void* arg ) { resultado1 = 0; for( uint64_t i = 1; i < input/2; i++ ) if ( input % i == 0 ) result1++; retorna NULO; } int resultado2 = 0; void* fact_worker2( void* arg ) { resultado2 = 0; for( uint64_t i = input/2; i result = 0; for( uint64_t i = task-> from; i < task-> to; i++ ) if ( task->num %i ==0 ) task-> result + +; return NULL; } /* assumindo threads_count < num */ uint64_t mp_factors( uint64_t num, size_t threads_count ) { struct fact_task* tasks = malloc( threads_count * sizeof( *tasks ) ); pthread_t* threads = malloc( threads_count * sizeof( *threads ) ); uint64_t start = 1; size_t step = num / threads_count; for( size_t i = 0; i < threads_count; i++ ) { tasks[i].num = num; tasks[i].from = start; tasks [i].to = start + step; start += step; } tasks[threads_count-1].to = num+1; for ( t_size i = 0; i < threads_count; i++ ) pthread_create( threads + i, NULL, fact_worker, tarefas + i);

372

Chapter 17 ■ Multiple Topics

uint64_t result = 0; for ( size_t i = 0; i < threads_count; i++ ) { pthread_join( threads[i], NULL ); result+=tasks[i].result; } free(tasks); free(topics); return result; } int main(void) { uint64_t input = 2000000000; printf("Factors of %"PRIu64": %"PRIu64"\n", input, factors_mp(input, THREADS ) ); return 0; } Suppose we are using t threads. To count the number of factors of n, we divide the range from 1 to n into t equal parts. We calculate the number of factors in each of these ranges and then sum the results. We create a type to store information about a single task called struct fact_task. It includes the number itself, the range bounds from and to, and the slot for the result, which will be the number of num factors between from and to. All workers that calculate the number of factors are implemented equally, such as a fact_worker routine, which accepts a pointer to a fact_task structure, calculates the number of factors, and populates the result field. The code that performs thread launch and result collection is contained in the factors_mp function, which, for a given number of threads, is • Allocate task descriptions and thread instances; • Initialization of task descriptions; • Start all threads; • Wait for each thread to finish its execution using join and adding its result to the common accumulator result; and • Free all allocated memory. So we put thread creation in a black box, which allows us to take advantage of multithreading. This code can be compiled with the following command: > gcc -O3 -std=c99 -pthread -o fact_mp dist_fact_mp.c Multiple threads are slowing overall execution time on a multicore system for this CPU-bound task. To test the runtime, we'll again use the time utility (a program, not a built-in shell). To make sure the program is being used instead of a built-in shell, we prefix it with a backslash. > gcc -O3 -o sp -std=c99 dist_fact_sp.c && \time ./sp Factors of 2000000000: 110 21.78user 0.03system 0:21.83elapsed 99%CPU (0avgtext+0avgdata 524maxresident)k 0in+0out (0major+207minor ) page faults 0 exchanges

373

Chapter 17 ■ Multiple Topics

> gcc -O3 -pthread -o mp -std=c99 dist_fact_mp.c && \time ./mp Factors of 2000000000: 110 25.48user 0.01system 0:06.58elapsed 387%CPU (0avgtext+0avgdata 656maxresident)k 0in+0out (0major +250minor)pagefaults 0swaps The multithreaded program took 6.5 seconds to run, while the single-threaded version took nearly 22 seconds. That's a huge improvement. To talk about performance let's introduce the notion of acceleration. Speedup is the improvement in the execution speed of a program running on two similar architectures with different features. By introducing more threads, we make more resources available for the possible improvement to take place. Obviously, for the first example we chose an easier and more efficient task to solve in parallel. Acceleration won't always be as substantial, if any; however, as we can see, the overall code is compact enough (it could be even less if we didn't take extensibility into account, for example, fixing a number of threads, instead of using it as a parameter).

■■Question 359 Try the number of threads and find the ideal in your own environment. ■■Question 360 Read about the functions: pthread_self and pthread_equal. Why can't we compare threads with a simple equality operator ==?

17.8.5 Mutexes Although thread binding is an accessible technique, it does not provide a means of controlling thread execution "on the fly". Sometimes we want to ensure that actions performed on one thread are not performed before some other action is performed on other threads. Otherwise, we will have a situation where the system will not always work stably: its output will depend on the actual order in which instructions from different threads will be executed. Occurs when working with mutable data shared between threads. These situations are called data races because threads compete for resources and any thread can win and get to them first. To avoid such situations there are several tools and let's start with mutexes. A mutex (short for "mutex") is an object that can be in two states: locked and unlocked. We work with them using two queries. • To lock. Change status from unlocked to locked. If the mutex is locked, the trying thread will wait until other threads unlock the mutex. • Unlock. If the mutex is locked, it will be unlocked. Mutual exclusions are often used to provide exclusive access to a shared resource (such as shared data). The thread that wants to work with the resource locks the mutex, which is used exclusively to control access to a resource. Upon completion of work with the resource, the thread unlocks the mutex. The mutex lock and unlock act as a compiler and full hardware memory barriers, so no reads or writes can be reordered before locking or after unlocking. Listing 17-12 shows an example program that requires a mutex.

374

Chapter 17 ■ Multiple Topics

Listed 17-12. mutex_ex_counter_bad.c #include #include #include #include

pthread_t t1, t2; uint64_t value = 0; void* impl1( void* _ ) { for (int n = 0; n < 10000000; n++) { value +=1; } returns NULL; } int main(void) { pthread_create(&t1, NULL, impl1, NULL ); pthread_create(&t2, NULL, impl1, NULL ); pthread_join(t1, NULL); pthread_join(t2, NULL); printf("%"PRIu64"\n", value); return 0; } This program has two threads, implemented by the impl1 function. Both threads constantly increment the value of the shared variable 1,000,000 times. This program must be compiled with optimizations disabled to prevent this increment loop from becoming a single value statement += 10000000 (or we can make the value volatile). gcc -O0 -pthread mutex_ex_counter_bad.c However, the resulting output is not 20000000, as we might have thought, and is different each time we run the executable: > ./a.out 11297520 > ./a.out 10649679 > . /a.out 13765500 The problem is that incrementing a variable is not an atomic operation from a C point of view. The generated assembly code conforms to this description using multiple instructions to read a value, add one, and return to set it . Allows the scheduler to provide CPU to another thread "in the middle" of a running increment operation. Optimized code may or may not have the same behavior.

375

Chapter 17 ■ Multiple Topics

To avoid this confusion, let's use a mutex to grant a thread the privilege of being the only one to work on the value. In this way, we reinforce correct behavior. Listing 17-13 shows the modified program. Listing 17-13. mutex_ex_counter_good.c #include #include #include #include

pthread_mutex_tm; // pthread_t t1, t2; uint64_t value = 0; void* impl1( void* _ ) { for (int n = 0; n < 10000000; n++) { pthread_mutex_lock( &m );// value += 1; pthread_mutex_unlock( &m );// } returns NULL; } int main(void) { pthread_mutex_init( &m, NULL );// pthread_create(&t1, NULL, impl1, NULL ); pthread_create(&t2, NULL, impl1, NULL ); pthread_join(t1, NULL); pthread_join(t2, NULL); printf("%"PRIu64"\n", value); pthread_mutex_destroy(&m); // returns 0; } Your output is consistent (although it takes longer to compute): > ./a.out 20000000 The programmer associates mutex m with a shared variable value. No value modifications should be made outside the code section between lock and unlock. If this constraint is met, there is no way for another thread to change the value after the lock is taken. The lock also acts as a memory barrier. Because of this, the value will be read back after the lock is obtained and can be safely cached in a registry. There is no need to make the variable value volatile as this will just suppress the optimizations and the program will be fine anyway.

376

Chapter 17 ■ Multiple Topics

Before a mutex can be used, it must be initialized with pthread_mutex_init, as seen in the main function. It accepts attributes such as the pthread_create function, which can be used to create a recursive mutex, create a deadlock detection mutex, control robustness (what happens if the thread owning the mutex dies?), and much more. 🇧🇷 To get rid of a mutex, the call to pthread_mutex_unlock is used.

■■Question 361  What is a recursive mutex? How is it different from an ordinary one?

17.8.6 Deadlocks A single mutex is rarely the cause of problems. However, when you lock multiple mutexes at once, all sorts of weird situations can occur. See the example shown in Listing 17-14. Listing 17-14. deadlock_ex mutual exclusion A, B; thread1() { lock(A); block B); unlock(B); unlock (A); } thread2() { lock(B); block A); unlock (A); unlock(B); } This pseudocode demonstrates a situation where both threads can hang forever. Imagine that the following sequence of actions occurred due to incorrect programming: • Thread 1 blocked A; control transferred to thread 2. • Thread 2 blocked B; control transferred to thread 1. After that, the threads will try to do the following: • Thread 1 will try to block B, but B is already blocked by thread 2. • Thread 2 will try to block A, but A is already blocked by thread 1 .Both threads will be stuck in this state forever. When threads are stuck in a blocked state waiting for others to unblock, the situation is called a deadlock. The cause of deadlock is the different order in which locks are executed by different threads. This brings us to a simple rule that will save us most of the time when we need to lock multiple mutexes at once.

(Video) 5 amazing websites to download books for FREE!

377

Chapter 17 ■ Multiple Topics

■■Deadlock Prevention  Order all mutexes in your program in an imaginary sequence. Just lock mutexes in the same order they appear in this sequence. For example, suppose we have mutexes A, B, C and D. We impose a natural order on them: A < B < C < D. If you need to lock D and B, you should always lock them in the same order, so B first, D first, second. If this invariant holds, two threads will not block a pair of mutexes in a different order.

■■Question 362  What are Coffman's conditions? How can they be used to diagnose deadlocks? ■■Question 363 How do we use Helgrind to detect deadlocks?

17.8.7 Livelocks Livelock is a situation where two threads are stuck but not in a mutex unlock wait state. Their states are changing, but they are not really progressing. For example, pthreads doesn't allow you to check whether the mutex is locked or not. It would be pointless to provide information about mutex state, because once you get information about it, the latter can already be changed by the other thread. if (the mutex is not locked) { /* We still don't know if the mutex is locked or not. It may already have been locked or unlocked multiple times by another thread.*/ } However, pthread_mutex_trylock is allowed, which either locks a mutex or returns an error if someone has already locked it. Unlike pthread_mutex_lock, it doesn't block the current thread waiting to be unlocked. Using pthread_mutex_trylock can lead to dynamic locking situations. Listing 17-15 shows a simple example in pseudocode. Listing 17-15. livelock_ex mutual exclusion m1, m2; thread1() { lock(m1); while ( mutex_trylock m2 indicates LOCKED ) { unlock( m1 ); wait one moment; block(m1); } // now we're good because both locks are done }

378

Chapter 17 ■ Multiple Topics

thread2() { block(m2); while ( mutex_trylock m1 indicates LOCKED ) { unlock( m2 ); wait one moment; block(m2); } // now we're fine because both locks were taken } Each thread tries to defy the "locks must always be executed in the same order" principle. Both want to lock two mutexes m1 and m2. The first thread works as follows: • Locks mutex m1. • Try locking m2 mutex. In case of failure, unlock m1, wait and lock m1 again. This pause is intended to give the other thread time to lock m1 and m2 and do whatever it wants. However, we can get stuck in a loop when 1. Thread 1 locks m1, Thread 2 locks m2. 2. Thread 1 sees that m2 is locked and unlocks m1 for a while. 3. Thread 2 sees that m1 is locked and unlocks m2 for a while. 4. Go back to step one. This loop can take forever to complete, or it can cause significant delays; It totally depends on the OS scheduler. So the problem with this code is that there are execution traces that will always prevent threads from progressing.

17.8.8 Condition Variables Condition variables are used in conjunction with mutexes. They are like threads that transmit an impulse to awaken a sleeping thread, waiting for some condition to be met. Mutexes implement synchronization by controlling a thread's access to a resource; condition variables, on the other hand, allow threads to be synchronized based on additional rules. For example, in the case of shared data, the actual value of the data can form part of this rule. The core of using condition variables consists of three new entities: • The condition variable itself of type pthread_cond_t. • A function to send an activation signal through a pthread_cond_signal condition variable. • A function to wait until a trigger signal arrives via a pthread_cond_wait condition variable. These two functions should only be used between locking and unlocking the same mutex. It is an error to call pthread_cond_signal before pthread_cond_wait, otherwise the program may crash. Let's study a minimal working example shown in Listing 17-16.

379

Chapter 17 ■ Multiple Topics

Listagem 17-16. condvar_mwe.c #include #include #include #include

pthread_cond_t condvar = PTHREAD_COND_INITIALIZER; pthread_mutex_tm; bool enviado = falso; vacío* t1_impl( vacío* _ ) { pthread_mutex_lock( &m ); puts("Subprocesso2 antes de esperar"); while (!enviado) pthread_cond_wait( &condvar, &m ); puts("Subprocesso2 depois de esperar"); pthread_mutex_unlock(&m); retornarNULL; } vácuo* t2_impl( vácuo* _ ) { pthread_mutex_lock( &m ); puts("Subprocesso1 antes do sinal"); enviado = verdadeiro; pthread_cond_signal(&condvar); puts("Subprocesso1 após o sinal"); pthread_mutex_unlock(&m); desenvolver NULL; } int principal (vazio) { pthread_t t1, t2; pthread_mutex_init( &m, NULO ); pthread_create( &t1, NULL, t1_impl, NULL ); dormir (2); pthread_create( &t2, NULL, t2_impl, NULL ); pthread_join(t1, NULL); pthread_join(t2, NULL); pthread_mutex_destroy(&m); devolver 0; }

380

Chapter 17 ■ Multiple Topics

Running this code will produce the following output: ./a.out Thread2 Thread1 Thread1 Thread2

before the wait before the sign after the sign after the wait

Initializing a condition variable can be done by assigning a special preprocessor constant PTHREAD_COND_INITIALIZER or by calling pthread_cond_init. The latter can accept a pointer to attributes of type pthread_condattr_t similar to pthread_create or pthread_mutex_init . In this example, two threads are created: t1, which executes the instructions of t1_impl, and t2, which executes those of t2_impl. The first thread locks the mutex m. It then waits for a signal that can be passed through the condvar condition variable. Note that pthread_cond_wait also accepts the pointer to the currently locked mutex. Now t1 is sleeping, waiting for the signal to arrive. The mutex m is unlocked immediately! When the thread receives the signal, it will automatically lock the mutex again and continue its execution from the next statement after the pthread_cond_wait call. The other thread is blocking the same mutex m and emitting a signal through condvar. The pthread_cond_ signal sends the signal through condvar, unlocking at least one of the threads, locked on the condvar condition variable. The pthread_cond_broadcast function would unlock all threads waiting for this condition variable, making them compete for the respective mutex as if they all issued pthread_mutex_lock. It is up to the programmer to decide in what order they will have access to the CPU. As we can see, condition variables allow us to block until we receive a signal. An alternative would be a "busy wait" where the value of a variable is checked constantly (thus killing performance and increasing unnecessary power consumption) as follows: while (some condition == false); Sure, we can put the thread to sleep for a while, but this way we'll wake up very rarely to react to the event in time or very often: while (somecondition == false) sleep(1); /* or something else that allows us to sleep less time */ Condition variables allow us to wait long enough and continue running the thread in the blocked state. An important moment must be explained. Why do we introduce a sent shared variable? Why do we use it together with the condition variable? Why are we waiting inside the while (!sent) loop? The most important reason is that the implementation is allowed to issue false wakes to a waiting thread. This means that the thread can wake up from waiting for a signal not just after receiving it, but at any time. In this case, as the sent variable is only defined before sending the signal, the spurious activation will check its value and, if it is still equal to false, it will issue pthread_cond_wait again.

17.8.9 Spinlocks A mutex is a secure way to synchronize. Attempting to lock a mutex that is already occupied by another thread puts the current thread into a sleep state. Putting the thread to sleep and waking it up has its costs, especially for context switching, but if the wait is long these costs are justified. We spend quite a bit of time sleeping and waking up, but in a long sleep state, the thread doesn't use the CPU.

381

Chapter 17 ■ Multiple Topics

What would be an alternative? The active idle, which is described by the following simple pseudocode: while (blocked == true) { /* do nothing */ }blocked = true; The locked variable is a flag that shows whether any thread has taken the lock or not. If another thread has taken the lock, the current thread will constantly poll its value until it changes again. Otherwise, it will proceed to obtain the lock on its own. This wastes CPU time (and increases power consumption), which is bad. However, it can increase performance if the wait time is too short. This mechanism is called a spinlock. Spinlocks only make sense on multicore and multiprocessor systems. Using spinlock on a single core is useless. Imagine that a thread enters the loop inside the spinlock. It is waiting for another thread to change the locked value, but no other thread is running at this very moment, because there is only one core switching from thread to thread. Eventually the scheduler will put the current thread to sleep and allow other threads to run, but that just means we waste CPU cycles running an empty loop for no reason! In that case, sleeping soon is always better, and therefore a spinlock is of no use. This scenario can of course also occur on a multicore system, but there is also (usually) a good chance that the other thread will unlock the spinlock before the allotted time for the current thread expires. In general, the use of spinlocks may or may not be beneficial; depends on system configuration, program logic, and workload. When in doubt, try and prefer to use mutexes (which are usually implemented by first getting a spinlock for several iterations and then entering a sleep state if the unlock doesn't happen). Implementing a fast and correct spinlock in practice is not so trivial. Questions remain to be answered, such as the following: • Do we need a memory barrier when locking and/or unlocking? If yes, which one? The Intel 64, for example, has lfence, sfence and mfence. • How do we ensure that the flag modification is atomic? On Intel 64, for example, an xchg instruction (with block prefix in the case of multiple processors) is sufficient. pthreads provides us with a carefully designed and portable spinlock mechanism. For more information, consult the man pages for the following functions: • pthread_spin_lock • pthread_spin_destroy • pthread_spin_unlock

17.9 Semaphores The semaphore is a shared integer variable on which three actions can be performed. • Initialization with an N argument. Sets its value to N. • Wait (enter). If the value is not zero, it decrements it. Otherwise, wait until someone else increments it and proceed with the decrement. • Publish (exit). Increase your value. Obviously, the value of this variable, not directly accessible, cannot fall below 0.

382

Chapter 17 ■ Multiple Topics

Semaphores are not part of the pthreads specification; we are working with semaphores whose interface is described in the POSIX standard. However, code that uses semaphores must be compiled with a -pthread flag. Most UNIX-like operating systems implement standard pthreads and semaphores features. The use of semaphores is quite common to perform synchronization between threads. Listing 17-17 shows an example of using semaphores. Listing 17-17. semaphore_mwe.c #include #include #include #include #include

sem_t without; uint64_t counter1 = 0; uint64_t counter2 = 0; pthread_t t1, t2, t3; void* t1_impl( void* _ ) { while( counter1 < 10000000 ) counter1++; without_post( &without ); return NULL; } void* t2_impl( void* _ ) { while( count2 < 20000000 ) count2++; without_post( &without ); return NULL; } empty* t3_impl( empty* _ ) { no_wait( &no ); without_wait( &without ); printf("Fin: counter1 = %" PRIu64 " counter2 = %" PRIu64 "\n", counter1, counter2); return NULL; } int main(empty) { sem_init( & sem, 0, 0 ); pthread_create( &t3, NULL, t3_impl, NULL ); sleep (1); pthread_create( &t1, NULL, t1_impl, NULL ); pthread_create( &t2, NULL, t2_impl, NULL );

383

Chapter 17 ■ Multiple Topics

sem_destroy(&sem); pthread_exit(NULL); return 0; } The sem_init function initializes the semaphore. Its second argument is a flag: 0 corresponds to a process-local semaphore (which can be used by different threads), the non-zero value defines a semaphore visible to multiple processes.3 The third argument defines the initial value of the semaphore. A semaphore is removed using the sem_destroy function. In the example, two counters and three threads are created. Threads t1 and t2 increment their respective counters to 1000000 and 20000000 and then increment the semaphore value by calling sem_post. t3 is blocked by decrementing the semaphore value twice. Then, when the semaphore has been incremented twice by other threads, t3 prints the counters to stdout. The pthread_exit call ensures that the main thread will not terminate prematurely until all other threads have finished their work. Semaphores are useful for tasks such as • Prohibiting more than n processes from simultaneously executing a section of code. • Make one thread wait for another to complete a specific action, thus imposing an order on its actions. • Do not maintain more than a fixed number of worker threads that execute a given task in parallel. More threads than necessary can decrease performance. It is not true that a two-state semaphore is completely analogous to a mutex. Unlike a mutex, which can only be unlocked by the same thread that locked it, any thread can change semaphores freely. We'll see another example of using the semaphore in Listing 17-18 to make two threads start each loop iteration simultaneously (and when the body of the loop executes, they wait for other loops to finish an iteration). Manipulations with semaphores obviously act as compiler and hardware memory barriers. For more information about semaphores, see the man pages for the following functions: • em_close • sem_destroy • sem_getvalue • sem_init • sem_open • sem_post • sem_unlink • sem_wait

■■Question 364  What is a named semaphore? Why must you necessarily unlink even if the process is complete?

In that case, the semaphore itself will be placed on the shared page, which will not be physically duplicated after the fork() system call is made.

3

384

Chapter 17 ■ Multiple Topics

17.10 How strong is the Intel 64? Abstract machines with a relaxed memory model can be difficult to follow. Confusing writes, future return values, and speculative reads are confusing. Intel 64 is generally considered strong. In most cases, it ensures that certain constraints are met, including but not limited to the following: • Stores are not reordered with older stores. • Stores are not reordered with older uploads. • Payloads are not reordered with other payloads. • In a multiprocessor system, stores in the same location have a total order. There are also exceptions, such as the following: • Writing to memory without going through the cache with instructions like movntdq can be reordered with other stores. • Chain instructions such as repeat moves can be reordered with other stores. You can find a complete list of guarantees in volume 3, section 8.2.2 of [15]. However, according to [15], "reads can be reordered with previous writes in different locations, but not with previous writes in the same location". So make no mistake: memory reorders happen. A simple program shown in Listing 17-18 demonstrates memory reordering performed by hardware. It implements an example already shown in Listing 17-4, where there are two threads and two shared variables x and y. The first thread stores x and loads y, the second thread stores y and loads x. The compiler barrier ensures that these two statements are translated to the assembler in the same order. As Section 17.10 suggests, warehouses and loads in separate locations can be rearranged. Therefore, we cannot rule out hardware memory reordering here, as x and y are independent! Listing 17-18. reordering_cpu_mwe.c #include #include #include #include #include #include #include

sem_t sem_begin0, sem_begin1, sem_end; int x, y, read0, read1; void *thread0_impl( void *param ) { for (;;) { sem_wait( &sem_begin0 ); x = 1; // This only disables compiler reorders: asm VOLATIL("" ::: "memory");

385

Chapter 17 ■ Multiple Topics

// The following line also disables hardware reorders: // volatile asm("mfence" ::: "memory"); read0 = e; sem_post( &sem_end ); } returns NULL; 🇧🇷 void *thread1_impl( void *param ) { for (;;) { sem_wait( &sem_begin1 ); y = 1; // This only disables compiler reorders: asm VOLATIL("" ::: "memory"); // The following line also disables hardware reorders // asm volatile("mfence" ::: "memory"); read1 = x; sem_post( &sem_end ); } returns NULL; 🇧🇷 int main( void ) { sem_init( &sem_begin0, 0, 0); sem_init( &sem_begin1, 0, 0); sem_init( &sem_end, 0, 0); pthread_t thread0, thread1; pthread_create(&thread0, NULL, thread0_impl, NULL); pthread_create(&thread1, NULL, thread1_impl, NULL); for (uint64_t i = 0; i < 100000; i++) { x = 0; y = 0; sem_post( &sem_begin0 ); sem_post( &sem_begin1 ); wk_wait(&wk_end); wk_wait(&wk_end); if (read0 == 0 && read1 == 0 ) { printf( "reordered in iteration %" PRIu64 "\n", i ); output(0); 🇧🇷

386

Chapter 17 ■ Multiple Topics

puts("No reordering detected for 100,000 iterations"); return 0; } To verify this, we performed several experiments. The main function works as follows: 1. Initialize threads and two initial and one final semaphores. 2. x = 0, y = 0 3. Notify the threads that they must start executing a transaction. 4. Wait for both threads to complete the transaction. 5. Verify that memory reordering is complete. It is seen when both load x and load y returned zeros (because they were reordered to come before store s). 6. If memory reordering is detected, we are notified and the process is terminated. If not, try again from step (2) up to a maximum of 100,000 attempts. Each thread waits for a start signal from main, executes the transaction, and notifies main about it. Then start all over again. After launch you will see that 100,000 iterations is enough to see a memory reorganization. > gcc -pthread -o sort -O2 sort.c > ./sort was resorted in iteration 128 > ./sort was resorted in iteration 12 > ./sort was resorted in iteration 171 > ./sort was resorted in iteration iteration 80 > . /sorting the reordering happened in iteration 848 > ./sorting the reordering happened in iteration 366 > ./sorting the reordering happened in iteration 273 > ./sorting the reordering happened in iteration 105 > ./sorting the reordering happened in iteration 14 > . /ordering the reordering happened in iteration 105 iteration 5 > ./ordering the reordering happened in iteration 414 It might sound like magic, but it's the lowest level of assembly language, even what you see here, and it introduces rarely noticed (but still persistent) bugs ) in the software. These bugs in multithreaded software are very difficult to detect. Imagine a bug that shows up only after four months of non-stop running, corrupting the heap and locking program allocations 42 after being fired! Therefore, writing high-performance non-blocking multithreaded software requires a lot of experience.

387

Chapter 17 ■ Multiple Topics

So what we need to do is add the mfence statement. Replacing the compiler fence with a volatile asm full memory fence( "mfence":::"memory"); it fixes the problem and the new orders disappear completely. If we do, no reordering will be detected no matter how many iterations we try.

17.11 What is non-blocking programming? We've seen how we can ensure consistency of operations when working in a multithreaded environment. Whenever we need to perform a complex operation on shared data or resources with no other threads involved, we lock a mutex that we associate with that resource or block of memory. The code is said to be lock-free if both of the following constraints are met: • No mutex is used. • The system cannot be locked indefinitely. This includes livelocks. In other words, it is a family of techniques that ensure secure manipulation of shared data without the use of mutexes. We almost always expect only a portion of program code to satisfy the non-blocking property. For example, a data structure such as a queue can be considered non-blocking if the functions used to manipulate it are non-blocking. So it doesn't stop us from blocking completely, but as long as we're calling functions like enqueue or dequeue, progress will be made. From a programmer's point of view, non-blocking programming is different from traditional mutexing because it presents two challenges that mutexes typically cover. 1. Rearrangements. While mutual exclusion manipulations involve hardware and compiler memory barriers, without them you must be specific about where to place the memory barriers. You don't want to put them after every statement because it hurts performance. 2. Non-atomic operations. Operations between locking and unlocking mutexes are safe and atomic in a sense. No other thread can modify the data associated with the mutex (unless there are unsafe data manipulations outside the lock and unlock section). Without this mechanism, we are stuck with very few atomic operations, which we will study later in this chapter. On most modern processors, reads and writes of naturally aligned native types are atomic. Natural alignment means aligning the variable with a boundary that corresponds to its size. On Intel 64, there is no guarantee that reads and writes greater than 8 bytes are atomic. Other memory interactions are generally non-atomic. Includes, but is not limited to: • 16-byte reads and writes performed by Streaming SIMD Extensions (SSE) instructions. • String operations (movsb instruction and the like). • Many operations are atomic on a single processor system, but not on a multiprocessor system (for example, instruction inc). Making them atomic requires the use of a special blocking prefix, which prevents other processors from performing their own read, modify, and write sequence between stages of these instructions. An inc instruction, for example, needs to read bytes from memory and write back its incremented value. Without a blocking prefix, they can intervene in between, which can lead to loss of value.

388

Chapter 17 ■ Multiple Topics

Here are some examples of non-atomic operations: char buf[1024]; data uint64_t* = (uint64_t*)(buf + 1); /* non-atomic: unnatural alignment */ *data = 0; /* non-atomic: increment may require a read and a write */ ++global_aligned_var; /* atomic write */ global_aligned_var = 0; void f(void) { /* atomic read */ int64_t local_variable = global_aligned_var; } These cases are architecture specific. We also want to perform more complex operations atomically (for example, incrementing the counter). To make them safely without using mutexes, engineers invented interesting basic operations like compare and swap (CAS). Once this operation is implemented as a machine instruction on a specific architecture, it can be used in combination with more trivial non-atomic reads and writes to implement many non-blocking algorithms and data structures. The CAS statement acts as an atomic sequence of operations, described by the following equivalent C function: bool cas(int* p , int old, int new) { if (*p != old) return false; *p = new; return true; } A shared counter, which is reading and writing a modified value, is a typical case when we need a CAS instruction to perform an atomic increment or decrement. Listing 17-19 shows a function to do this. Listing 17-19. cas_counter.c int add(int* p, int add ) { bool done = false; int value; while (!done) { value = *p; fact = cas(p, value, value + sum); } returns value + adds; 🇧🇷

389

Chapter 17 ■ Multiple Topics

This example shows a typical pattern seen in many CAS-based algorithms. They read a given memory location, calculate its modified value, and repeatedly try to swap the new value if the current memory value is the same as the old one. Fails if this memory location has been modified by another thread; then the entire read-modify-write cycle is repeated. Intel 64 implements the cmpxchg, cmpxchg8b, and cmpxchg16b CAS instructions. In the case of multiple processors, they also require the use of a lock prefix. The cmpxchg command is of particular interest. It accepts two operands: register or memory and a register. Compare rax4 with the first operand. If they are equal, the zf flag is set, the value of the second operand is loaded into the first. Otherwise, the actual value of the first operand is loaded into rax and zf is cleared. These instructions can be used as part of implementing mutexes and semaphores. As we'll see in Section 17.12.2, there is now a standard-compliant way to use compare and set operations (as well as manipulate with atomic variables). We recommend keeping it to avoid non-portable code and using atomic whenever you can. When you need complex operations to be performed atomically, use mutexes or stick with specialized non-blocking data structure implementations: Writing non-blocking data structures has proven to be a challenge.

■■Question 365  What is the ABA problem? ■■Question 366 Read the description of cmpxchg in the Intel documents [15].

17.12 C11 Memory Model 17.12.1 Overview Most of the time, we want to write the correct code on each architecture. To achieve this, we rely on the memory model described in the C11 standard. The compiler can easily implement some operations or issue special instructions to enforce certain guarantees when the actual hardware architecture is weaker. Unlike the Intel 64, the C11 memory model is quite weak. It guarantees the dependency order of the data, but nothing more, so that in the classification mentioned in section 17.4 corresponds to the second: weak with dependency order. There are other hardware architectures that provide similar weak guarantees, for example ARM. Due to C's weakness for writing portable code, we cannot assume it will run on a generally robust architecture such as Intel 64, for two reasons. • When recompiling for another, weaker architecture, the observed program behavior will change due to how hardware reorders work. • When recompiling for the same architecture, compiler reorders may occur that do not break the weak ordering rules imposed by the standard. This can alter the behavior of the observed program, at least for some execution traits.

17.12.2 Atomics The important feature of C11 that can be used to write fast multithreaded programs is atomics (see section 7.17 of [7]). These are special types of variables, which can be modified atomically. To use them, include the stdatomic.h header. 4

Or eax, ax, al – depending on operand size

390

Chapter 17 ■ Multiple Topics

Apparently, a supporting architecture is needed to implement them efficiently. In the worst case, when the architecture does not support such an operation, each variable of this type will be paired with a mutex, which will block any modification or even reading of the variable. Atomics allows us to perform thread-safe operations on ordinary data in some cases. It is often possible to do without heavy machinery involving mutexes. However, writing data structures as non-blocking queues is not an easy task. For this, we strongly recommend using existing implementations as "black boxes". C11 defines a new _Atomic() type specifier. You can declare an atomic integer like this: counter _Atomic(int); _Atomic transforms the name of a type into the name of an atomic type. Alternatively, you can use atomic types directly like this: counter atomic_int; A complete correspondence between the direct type forms _Atomic(T) and atomic_T can be found in section 7.17.6 of [7]. Atomic local variables must not be initialized directly; instead, the ATOMIC_VAR_INIT macro should be used. This is understandable, because on some architectures with less hardware resources, each variable of this type must be associated with a mutex, which must also be created and initialized. Global atomic variables are guaranteed to be in a correct initial state. ATOMIC_VAR_INIT must be used during variable declaration along with initialization; however, if you want to initialize the variable later, use the atomic_init macro. void f(void) { /* Initialization during declaration */ atomic_int x = ATOMIC_VAR_INIT( 42 ); atomic_int y; /* post-init */ atomic_init( &y, 42 ); } It is your responsibility to ensure that the initialization of the atomic variable completes before anything else is done with it. In other words, concurrent access to the variable being initialized is a data race. Atomic variables should only be manipulated through an interface, defined in the language standard. It consists of various operations such as loading, storing, swapping, etc. Each of them exists in two versions. • An explicit version, which takes an extra argument, describing the memory order. Its name ends with _explicit. For example, the load operation is T atomic_load_explicit( _Atomic(T) *object, memory_order order ); • An implicit version, which implies the strongest (sequentially consistent) memory order. There is no explicit _suffix. For example, T atomic_load( _Atomic(T) *object );

391

Chapter 17 ■ Multiple Topics

17.12.3 Memory Orders in C11 The memory order is described by one of the enumeration constants (in increasing order of strictness). • memory_order_relaxed implies the weakest model: any reordering of memory is possible as long as it does not change the observable behavior of the single-threaded program. • memory_order_consume is a weaker version of memory_order_acquire. • memory_order_acquire means that the load operation has acquisition semantics. • memory_order_release means that the store operation has release semantics. • memory_order_acq_rel combines acquisition and release semantics. • memory_order_seq_cst implies that no memory reordering is performed for all operations marked with it, regardless of which atomic variable is referenced. By providing an explicit memory ordering constant, we can control how we want to allow operations to be observably reordered. It includes compiler reorders and hardware reorders, so when the compiler realizes that compiler reorders don't provide all the guarantees we need, it also issues platform-specific instructions like sfence. The memory_order_consume option is rarely used. It is based on the notion of "consumer operation". This operation is an event that occurs when a value is read from memory and then used in multiple operations, creating a data dependency. On weaker architectures like PowerPC or ARM, its use can lead to better performance due to exploiting data dependencies to impose a certain order on memory accesses. In this way, the costly instruction hardware memory barrier is spared, as these architectures guarantee data dependency ordering without explicit barriers. However, due to the fact that this order is so difficult to implement efficiently and correctly in compilers, it is usually mapped directly to memory_order_acquire, which is a slightly stronger version. We do not recommend using it. See [30] for additional information. The semantics of acquiring and releasing these memory ordering options correspond directly to the notions we discussed in Section 17.7. The memory_order_seq_cst corresponds to the notion of sequential consistency, which we elaborated on in Section 17.4. Because all non-explicit operations with atomics accept it as a default memory ordering value, C11 atomics are sequentially consistent by default. It's the safest route and is also typically faster than mutexes. Weaker orders are harder to hit, but also allow for better performance. The atomic_thread_fence(memory_order order) allows us to insert a memory fence (compiler and hardware) with a strength corresponding to the specified memory order. For example, this operation has no effect for memory_order_relaxed, but for consistent sequential ordering on Intel 64, the mfence instruction will be issued (along with the compiler fence).

17.12.4 Operations The following operations can be performed on atomic variables (T denotes the non-atomic type, U refers to the type of the other argument for arithmetic operations; for all types except pointers it is the same as T, for pointers it is ptrdiff_t ). void atomic_store(volatile _Atomic(T)* object, Tvalue); T atomic_load(volatile object _Atomic(T)*); T atomic_exchange(volatile _Atomic(T)* object, desired);

392

Chapter 17 ■ Multiple Topics

T T T T T

atomic_fetch_add(volatile atomic_fetch_sub(volatile atomic_fetch_or (volatile atomic_fetch_xor(volatile atomic_fetch_and(volatile

_Atomic(T)* _Atomic(T)* _Atomic(T)* _Atomic(T)* _Atomic(T)*

object, object, object, object, object,

U U U U U

operand); operand); operand); operand); operand);

bool atomic_compare_exchange_strong( volatile object _Atomic(T)*, T * expected, T desired); bool atomic_compare_exchange_weak( volatile object _Atomic(T)*, T * expected, T desired); All these operations can be used with an explicit _ suffix to provide memory ordering as an additional argument. The load and store functions need no further explanation; we will discuss the others shortly. atomic_exchange is a combination of loading and storing: it replaces the value of an atomic variable with the desired one and returns its previous value. The fetch_op family of operations is used to atomically change the value of the atomic variable. Imagine that you need to increment an atomic counter. Without fetch_add it's impossible to do that, because to increment it you need to add one to its previous value, which you have to read first. This operation is performed in three steps: read, add, write. Other threads can interfere between these stages, which destroys atomicity. atomic_compare_exchange_strong is preferable to its weak counterpart as the weak version may fail falsely. The latter performs better on some platforms. The atomic_compare_exchange_strong function is roughly equivalent to the following pseudocode: if ( *object == *expected ) *object = desired; else *expected = *object; As you can see, this is a typical CAS instruction that was discussed in Section 17.11. The atomic_is_lock_free macro is used to check whether a specific atomic variable uses locks or not. Remember that without giving the explicit memory order, all of these operations are considered sequentially consistent, which on Intel 64 means mfence instructions throughout the code. This can be a huge performance killer. The boolean shared flag has a special type called atomic_flag. It has two states: fixed and clear. Operations on it are guaranteed to be atomic without the use of locks. The flag must be set with the ATOMIC_FLAG_INIT macro as follows: atomic_flag is_working = ATOMIC_FLAG_INIT; The relevant functions are atomic_flag_test_and_set and atomic_flag_clear, both of which have explicit counterparts, accepting memory ordering descriptions.

■■Pergunta 367 Leia as man pages atomic_flag_test_and_set e atomic_flag_clear.

393

Chapter 17 ■ Multiple Topics

17.13 Summary In this chapter, we studied the basic concepts of multithreaded programming. We've seen the different memory models and the problems that come with compiler and hardware optimizations interfering with the order of execution of instructions. We learned to control them, putting different memory barriers, we saw why volatile is not a solution to the problems that arise from multithreading. Next, we introduce pthreads, the most common pattern for writing multithreaded applications on Unix-like systems. We saw thread management, used mutexes and condition variables, and learned why spinlocks only have meaning on multicore and multiprocessor systems. We've seen how memory reorders must be considered even when working on a generally robust architecture like the Intel 64, and we've seen the limits of their rigor. Finally, we studied atomic variables, a very useful feature of C11 that allows us to eliminate the explicit use of mutexes and, in many cases, increase performance while maintaining correctness. Mutexes are still important when we want to perform complex manipulations on non-trivial data structures.

■■Question 368  What are the problems with using multithreading? ■■Question 369  What makes having multiple topics worthwhile? ■■Question 370  Should we use multithreading even if the program doesn't do a lot of calculations? If yes, provide a use case. ■■Question 371  What is compiler reordering? Why is this done? ■■Question 372  Why doesn't the single-threaded program have a way to observe compiler memory reorders? ■■Question 373  What are some types of memory models? ■■Question 374 How do we write sequentially consistent code regarding handling two shared variables? ■■Question 375 Are volatile variables sequentially consistent? ■■Question 376  Show an example where memory reordering can lead to highly unexpected program behavior. ■■Question 377  What are the arguments against using volatile variables? ■■Question 378  What is a memory barrier? ■■Question 379  What types of memory barriers do you know? ■■Question 380  What is acquisition semantics? ■■Question 381  What are release semantics? ■■Question 382  What is a data dependency? Can you write code where the data dependency doesn't force an order in operations? ■■Question 383  What is the difference between mfence, sfence and lfence? ■■Question 384  Why do we need other commands besides mfence? 394

Chapter 17 ■ Multiple Topics

■■Question 385  What function calls act as compiler barriers? ■■Question 386 Are inline function calls compiler barriers? ■■Question 387  What is a thread? ■■Question 388  What is the difference between threads and processes? ■■Question 389  What constitutes the status of a process? ■■Question 390  What constitutes the state of a thread? ■■Question 391  Why should the -pthread flag be used when compiling with pthreads? ■■Question 392 Is pthreads a static or dynamic library? ■■Question 393 How do we know in what order the scheduler will execute the threads? ■■Question 394  Can a thread access another thread's stack? ■■Question 395  What does pthread_join do and how can we use it? ■■Question 396  What is a mutex? Why do we need it? ■■Question 397  Must every shared constant variable be associated with a mutex? ■■Question 398  Should every shared mutable variable that never changes be bound to a mutex? ■■Question 399  Does each shared mutable variable that is changed need to be associated with a mutex? ■■Question 400  Can we work with a shared variable without using a mutex? ■■Question 401  What is a deadlock? ■■Question 402 How do we avoid deadlock? ■■Question 403  What is an active lock? How is it different from a deadlock? ■■Question 404  What is a spinlock? What is the difference between a livelock and a stalemate? ■■Question 405  Should spinlocks be used on a single core system? Why? ■■Question 406  What is a condition variable? ■■Question 407  Why do we need condition variables if we have mutexes? ■■Question 408  What guarantees does Intel 64 provide for memory rearrangements? ■■Question 409  What important guarantees does Intel 64 not provide for memory reordering? ■■Question 410  Correct the program shown in Listing 17-18 so that memory is not reordered. ■■Question 411  Correct the program shown in Listing 17-18 so that memory reordering does not occur through the use of atomic variables.

395

Chapter 17 ■ Multiple Topics

■■Question 412  What is non-blocking programming? Why is it more difficult than traditional multithreaded programming with locks? ■■Question 413  What is a CAS operation? How can it be implemented on Intel 64? ■■Question 414 How strong is the memory of the C model? ■■Question 415  Can the strength of the C memory model be controlled? ■■Question 416  What is an atomic variable? ■■Question 417  Can any data type be atomic? ■■Question 418  Which atomic variables must be explicitly initialized? ■■Question 419  What memory arrangements does C11 recognize? ■■Question 420 How are explicitly suffixed atomic variable manipulation functions different from their common counterparts? ■■Question 421 How do we perform an atomic increment on an atomic variable? ■■Question 422 How do we perform an atomic XOR on an atomic variable? ■■Question 423  What is the difference between weak and strong versions of compare_exchange?

396

PART IV

Appendices

CHAPTER 18

Appendix A. Using gdb The debugger is a very powerful tool at your disposal. It allows you to run programs step by step and monitor their status, including register values ​​and memory contents. In this book, we are using the debugger called gdb. This appendix is ​​an introduction designed to help you get started. Debugging is a process of finding errors and studying program behavior. To do this, we usually perform individual steps by looking at a part of the program state that interests us. We can also run the program until a certain condition is met or a position in the code is reached. This position in the code is called the breakpoint. Let's study an example program shown in Listing 18-1. We already saw this in Chapter 2. This code prints the contents of the rax register to standard output. Listing 18-1. print_rax_2.asm section .datacodes: db'0123456789ABCDEF' section .text global _start _start: mov rax, 0x1122334455667788 mov rdi, mov rdx, mov rcx, .loop: push rax sub rcx, sar rax and rax,

1 1 64

4cl 0xf

lea rsi, [codes + rax] mov rax, 1 push rcx syscall pop rcx

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_18

399

Chapter 18 ■ Appendix A. Using gdb

pop rax test rcx, rcx jnz .loop movrax, 60; call the 'shutdown' system call xor rdi, rdi syscall Let's compile a print_rax executable from it and run gdb. > nasm -o print_rax.o -f elf64print_rax.asm > ld -o print_rax print_rax.o > gdb print_rax ... (gdb) gdb has its own command system and interaction with it takes place via these commands. So whenever gdb starts and you see its command prompt (gdb) you can type commands and it will interact accordingly. You can load an executable file by issuing the file command and typing the file name or passing it as an argument. (gdb) print_rax file Reading symbols from print_rax... (no debug symbols found)... done. The main functions in the gdb command prompt perform autocomplete suggestions. Many commands also have abbreviations. The two most important commands are • quit to exit gdb. • help cmd to display help for the cmd command. The ˜/.gdbinit file stores commands that will be executed automatically when gdb is started. This file can also be created in the current directory, but for security reasons this feature is disabled by default.

■■Note To enable loading the .gdbinit file from any directory, add the following line to the ˜/.gdbinit file in your home directory: configure auto-load safe-path /

By default, gdb uses AT&T assembly syntax. In our book, we follow the Intel syntax; To change gdb's default preferences regarding assembly syntax, add the following line to the ˜/.gdbinit file: set disassembly-flavor intel

400

Chapter 18 ■ Appendix A. Using gdb

Other useful commands include the following: • run starts program execution. • break x creates a breakpoint next to label x. When we run or continue, we stop at the first breakpoint, allowing us to examine the state of the program. • break *address to place a breakpoint at a specified address. • continue executing the program • stepi or si to advance one instruction; • ni or nexti will also execute a statement, but will not insert functions if the statement was called. Instead, it will allow the called function to terminate and be interrupted at the next statement. Let's do the following: (gdb) break _start Breakpoint 1 at 0x4000b0 (gdb) start Undefined "main" function. Make breakpoint pending on future shared library load? (y or [n]) n Startup: /home/stud/test/print_rax Breakpoint 1, 0x000000000004000b0 at _start() We stop at the breakpoint we placed in the _start tag. Let's switch to pseudographic mode using the commands: layout asm layout regs The output is shown in Figure 18-1. The layout consists of three windows: • The upper part shows the registers and their current values. • The center part shows the disassembly code. • The bottom part is an interactive prompt. • One of these windows currently has focus. Ctrl-X and Ctrl-O let you switch between them. • Arrow keys can be used to scroll up and down in the current window. • print /FMT allows you to search for register contents or memory values. Registry names are prefixed with dollar signs, for example: $rax. • x /FMT is another very useful command to check memory contents. Unlike print, it expects one level of indirection, so it accepts a pointer.

401

Chapter 18 ■ Appendix A. Using gdb

Figure 18-1. gdb UI: layout asm + regs FMT (used by the print and x commands) is an encoded format description. It allows us to explicitly choose the data type to correctly interpret the memory contents. FMT consists of a font format and a font size. The most useful format letters are • x (hexadecimal) • a (address) • i (instruction, disassembly attempts) • c (character) • s (null-terminated string) The most useful format letters are b (byte ) and g (giant, 8 bytes). To get an address of a variable, use the ampersand &. The examples will show when it is useful. Following are some examples based on the program shown in Listing 18-1: • Display the contents of rax: (gdb) print $rax $1 = 1234605616436508552 • Display the first character of the codes: (gdb) print /c codes $2 = '0 '

402

Chapter 18 ■ Appendix A. Using gdb

• Disassemble an instruction at address _start: (gdb) x /i &_start 0x4000b0 :movabs rax,0x1122334455667788 • Disassemble the current instruction: (gdb) x /i $rip => 0x4000e9 :jne0x4000c9 • Check the contents of the codes. The /FMT part of the x command may start with item count. In our case, /12cb means "12 characters of one byte each". (gdb) x /12cb & codes 0x6000f8 :48 '0' 49 '1' 50 '2' 51 '3' 52 '4' 53 '5' 54 '6' 55 '7' 0x600100:56 '8' 57 ' 9 ' 65 'A' 66 'B' • Examine the first 8 bytes on the stack: (gdb) x /x $rsp 0x7fffffffdf90: 0x01 • Examine the second qword stored on the stack: (gdb) x/x $rsp+ 8 0x7ffffffffdf98: 0xc1

■■Question 424 Study the output of the help x command. To use gdb productively with C programs, remember to always use the -ggdb compiler option. Generates additional information that gdb can use, such as the .line section or symbols for local variables. A proper layout for working with C code is src; type layout src to switch to it. Figure 18-2 shows this design.

403

Chapter 18 ■ Appendix A. Using gdb

Figure 18-2. gdb user interface: layout src Another useful option is to study and navigate through a call stack. Whenever a function is called, it uses a portion of a stack to store its local variables. To demonstrate navigation, we'll use a simple program shown in Listing 18-2. Listing 18-2. call_stack.c #include void g(int garg) { int glocal = 99; puts("Inside"); } void f(int farg) { int flocal = 44; g(flocal); } int main(void){ f(42); return 0; 🇧🇷

404

Chapter 18 ■ Appendix A. Using gdb

Let's compile the program and run gdb on it as follows: > gcc -ggdb call_stack.c -o call_stack > gdb call_stack Then we place a breakpoint in the g function and run the program as follows: (gdb) break g Breakpoint 1 at 0x400531 : file call_stack.c, line 5. (gdb) run the init program: .../call_stack Breakpoint 1, g (garg=0) in call_stack.c:5 5puts("Inside g"); We are free to issue layout src if we wish. The program will run and stop at line 4 where function g starts. We can explore local variables or arguments using the print command. gdb will get the rates right for us most of the time. (gdb) print garg $1= 44 We want to see which functions are currently active. The backtrace command is the way to do this. (gdb) backtrace #0g (garg=44) in call_stack.c:4 #10x0000000000400561 in f (farg=42) in call_stack.c:10 #20x00000000000400572 in main () in call_stack.c:14 There are three stack frames that gdb is aware of this and we can switch between them using the frame command. Our state at this point is depicted in Figure 18-3. We are pretty sure that function f has thrown function g as the backtrace says, so the instance of f must have a local variable flocal. We want to know your worth. If we try to print it immediately, gdb complains that the variable doesn't exist. However, if we select the appropriate stack frame using the frame 1 command first, we gain access to all of its local variables. Figure 18-4 shows this change. (gdb) prints farg NoSymbol "farg" in the current context.

405

Chapter 18 ■ Appendix A. Using gdb

Figure 18-3. built-in function g

(gdb) cuadro 1 #10x0000000000400561 en f (farg=42) en call_stack.c:10 (gdb) print farg $3=42

■■Question 425  What do local info do? In addition, gdb supports evaluating expressions with common arithmetic operations, initializing functions, writing automation scripts in Python, and much more. To read more, see [1].

406

Chapter 18 ■ Appendix A. Using gdb

Figure 18-4. built-in function f

407

CHAPTER 19

Appendix B. Using Make This appendix will introduce you to the basics of creating Makefiles. For more information, see [2]. To create a program, you may need to do several things: start the compiler with the correct flags, probably for each source file, and use the linker. Sometimes you must also run scripts written to generate source code files. Sometimes the program consists of several parts written in different programming languages! Also, if you only changed part of the program, you might not want to rebuild everything, but only the parts that depend on the changed source file. Huge programs can take hours of CPU (central processing unit) to build! In this book we are going to use GNU Make. It is a common tool used to control the generation of artifacts such as executable files, dynamic libraries, resource files, etc.

19.1 Simple Makefile When writing a program, you should create a special makefile for it so that you can use Make to create it. This text file describes the source files and the dependencies between them declaratively. Then make will choose the correct order in which the files should be worked on so that when each file is processed, its dependencies are already processed. To start the build process, run make in the directory where the Makefile is created. This is usually the root directory of your project. You can explicitly select another Makefile by supplying the -f flag, for example: make -f Makefile_other. The basic Makefile is composed of the following blocks, each called a rule: : [tab] A rule describes how to generate a specific file, which is the . describe which other goals should be generated first. A recipe consists of one or more actions to be performed by make. Each recipe line must be preceded by the [tab] character! Let's say we have a simple program consisting of two assembly files: main.asm and lib.asm. We want to output the object file for each of these and then link them into an executable program. Listing 19-1 shows an example of a simple Makefile.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_19

409

Chapter 19 ■ Appendix B. Using Make

Listing 19-1. Makefile_simple program: main.o lib.o ld -o program main.o lib.o lib.o: lib.asm nasm -f elf64 -o lib.o lib.asm main.o: main.asm nasm -f elf64 - o main.o main.asm clean: rm main.o lib.o program When the Makefile is created with this content, running make in the same directory will launch the recipe for the first target described. If a target named all is present, your recipe will run. Otherwise, typing make targetname will run the recipe for targetname. The target program must produce the program. To do this, we must first build the main.o and lib.o files. If we change the main.o file and run make again, only main.o will be rebuilt before updating the program, but not lib.o. The same mechanism forces a rebuild of lib.o when lib.asm is changed. Thus, the recipe is launched when there is no file matching the destination name or this file needs to be changed (because one of its dependencies has been updated). Traditionally, each Makefile has a purpose called clean to get rid of all created files, leaving only the sources. Targets like clean are called bogus targets because they don't match a specific file. It's best to list them in a separate recipe corresponding to a special .PHONY target as follows: clean: rm -f *.o help: echo 'This is help' .PHONY: clean help

19.2 Introducing variables It is not very appropriate to duplicate a lot of text in Makefiles. Once there are many source files that compile the same way, we get tired of repeatedly copying the same compilation options. Variables solve this problem. Variables are declared as follows: variable = value They are not the same as environmental variables like PWD. Their values ​​are replaced with a dollar sign and a pair of parentheses as follows: $(variable)

410

Chapter 19 ■ Appendix B. Using Make

Now, let's use variables at least in the following cases: • To abstract the compiler (we can easily switch between Clang, GCC, MSVC or any other compiler as long as they support the same set of flags). 🇧🇷 • To abstract compilation flags. Traditionally, in the case of C, these variables are called • CC for “C compiler”. • CFLAGS for "C compiler flags". • LD for "link editor" (linker). • AS for “assembly language compiler”. • ASFLAGS as “assembly language compiler flags”. An added benefit is that whenever we want to choose build flags, we only have to do it in one place. Listing 19-2 shows the modified Makefile. Listing 19-2. Makefile_vars AS = nasm LD = ld ASFLAGS = -f elf64 program: main.o lib.o $(LD) -o program main.o lib.o lib.o: lib.asm $(AS) $(ASFLAGS) -o lib.o lib.asm main.o: main.asm $(AS) $(ASFLAGS) -o main.o main.asm clean: rm main.o lib.o program .PHONY: clean A variable can be left blank , and will expand to an empty string: EMPTYVAR= A variable can include values ​​of other variables: INCLUDEDIR= include CFLAGS= -c -std=c99 -I$(INCLUDEDIR) -ggdb -Wno-attributes

411

Chapter 19 ■ Appendix B. Using Make

Target names support the % wildcard symbol. There must be only one wildcard in a target name. The substring that corresponds to % is called the stem. Occurrences of % in prerequisites are replaced exactly with the root. For example, this rule %o : %c echo "Creating an object file" specifies how to create any object file from a named .c file. However, at the moment we don't know how to use these rules, because as soon as we try to write a command to compile the file we run into a problem: we don't know the exact names of the files involved and the trunk is inaccessible inside the recipe. 🇧🇷 Auto variables solve this problem.

19.3 Automatic Variables Automatic variables are a special feature of make. They are recalculated for each rule run, and their values ​​depend on the target and its prerequisites. They can only be used within the recipe itself, not within the prerequisites or within the objective itself. Imagine that you want to compile each .c file into a .o file with the same flags. Should we really bend all the rules? No, we can use wildcards together with auto variables. There are many auto variables, but the most used ones are • $* The root. 🇧🇷[email protected]The file name of the rule target. • $< The name of the first prerequisite. • $ˆ The names of all prerequisites separated by spaces. ps The names of all prerequisites that are newer than the target. Listing 19-3 shows an example Makefile that uses all the knowledge from this tutorial. Listing 19-3. makefile_autovars CC = gcc CFLAGS = -std=c11 -Wall LD = gcc all: main main: main.o lib.o $(LD) $ˆ -o[email protected]%.o: %.c %.h $(CC) $(FLAG) -c $< -o[email protected]clean: rm -f *.o main .PHONY: clean

412

Chapter 19 ■ Appendix B. Using Make

Assume the following project tree: . lib.c lib.h main.c main.h Makefile 0 directories, 8 files A clean make will run the following commands: > make gcc -std=c11 -Wall -c main.c -o main.o gcc -std= c11 -Wall -c lib.c -o lib.o gccmain.o lib.o -o main See the well-written GNU Make Manual [2] for further instructions.

413

CHAPTER 20

Appendix C. System Calls Throughout this book, we use a number of system calls. We collect the information about them in this attachment.

■■Note  It's always a good idea to read the man pages first, for example, man -s 2 write. The exact values ​​of flags and parameters vary from system to system and should never be used right away. If writing in C, use the relevant headings (shown in the man pages for the system call of interest). If you write in assembler, you'll either have to use LXR or another inline system with annotated kernel code, or look at these C headers and create your own corresponding to %define. The values ​​provided are valid for the following system: > uname -a Linux 3.16-2-amd64 #1 SMP Debian 3.16.3-2 (2014-09-20) x86_64 GNU/Linux Issuing an assembler system call is simple: just initialize the relevant registers with the correct parameter values ​​(in any order) and execute the syscall instruction. If you need flags, you must first define them yourself; we provide you with their exact values. Remember that NASM can also calculate constant expressions such as O_TRUNC|O_RDWR. Issuing a system call in C is usually done like calling a function, whose declaration is provided in some include files.

■■Note  In C, never use flag values ​​directly, such as replacing O_APPEND with 0x1000. Use the definitions provided in the header files as they are more readable and portable. Since we won't have corresponding assembly headers, we need to define them manually in the assembly files.

20.1 read ssize_t read(int fd, void *buf, size_t count); Description Reads from a file descriptor. rax

rdi

rs

rdx

int fd

empty * breath

t_size count

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_20

r10

r8

r9

415

Chapter 20 ■ Appendix C. System Calls

20.1.1 Arguments 1. File descriptor fd we read from. 0 for standard input; use the open system call to open an F file by name. 2. buf The address of the first byte in a sequence of bytes. Received bytes will be placed there. 3. count

Let's try to read that number of bytes.

Returns rax = number of bytes successfully read, -1 in case of error. Includes for use in C: #include

20.2 write ssize_t write(int fd, const void *buf, size_t count); Description Write to a file descriptor. rax

rdi

rs

rdx

r10

1

int fd

void constant buff*

t_size count

r8

r9

20.2.1 Arguments 1. fd File descriptor to which we write. 1 for standard output, 2 for standard output; use the open system call to open a file by name. 2. buf The address of the first byte in a sequence of bytes to be written. 3. count

Let's try to write this number of bytes.

Returns rax = number of bytes written successfully, -1 on error. Includes for use in C: #include

20.3 open int open(const char *pathname, int flags, mode_t mode); Description Open file with a given name (null-terminated string) rax

rdi

rs

rdx

2

const char * file name

internal flags

just int

416

r10

r8

r9

Chapter 20 ■ Appendix C. System Calls

20.3.1 Arguments 1. file name 2. flags

The name of the file to open (null-terminated string).

highlighted below. They can be combined using |, for example, A O_CREAT| O_BAD|O_TRUNC.

3. mode I is integer coding permission for user, group and all others. They are similar to those used by the chmod command. Returns rax = new file descriptor for the given file, -1 on error. Includes for use in C: #include #include #include

20.3.2 Flags • O_APPEND = 0x1000 Append to a file on each recording. • O_CREAT = 0x40 Create a new file. • O_TRUNC = 0x200 If the file already exists and is a regular file and the access mode allows writing, it will be truncated to length 0. • O_RDWR = 2 Read and write. • O_WRONLY = 1 Write only. • O_RDONLY = 0 Read only.

20.4 close int close(int fd); Description Closes the file with a given name (null-terminated string) rax

rdi

rs

rdx

2

const char * file name

internal flags

just int

r10

r8

r9

417

Chapter 20 ■ Appendix C. System Calls

20.4.1 Arguments 1. Look for a valid file descriptor that must be closed. Returns rax = zero on success, -1 on failure. The errno global variable contains the error code. Includes for use in C: #include

20.5 mmap void *mmap( void *addr, size_t length, int prot, int flags, int fd, off_t offset); Description Allocates pages in the virtual address space to something. It could be anything behind a "file" (devices, files on disk, etc.) or just physical memory. In the latter case, the pages are anonymous, they do not correspond to anything present on the file system. These pages contain the heap and stacks of a process. rax

rdi

rs

rdx

r10

r8

r9

9

replace* address

t_size longo

profit int

internal flags

int fd

off_t off

20.5.1 Arguments 1. address

hint to the starting virtual address of the newly mapped region. We try to get A to map to that address, and if that doesn't work, we let the operating system (OS) choose it. If it is 0, it will always be chosen by the OS.

2. len Length of a mapped region in bytes. 3. benefit

Protection flags (see below). They can be combined using |.

4. banderas

Behavioral indicators (see below). They can be combined using |.

5. fd A valid file descriptor for the file to be allocated, ignored if the MAP_ANONYMOUS behavior flag is used. 6. Boot offset in fd. We ignore all bytes before this offset and map file S starting with it. Ignored if the MAP_ANONYMOUS behavior flag is used. Returns rax = pointer to mapped area, -1 on error. Includes for use in C: #include

418

Chapter 20 ■ Appendix C. System Calls

20.5.2 Protection flags • PROT_EXEC = 0x4

Pages can run.

• PROT_READ = 0x1

The pages can be read.

• PROT_WRITE = 0x2 • PROT_NONE = 0x0

Pages can be written. The pages cannot be accessed.

20.5.3 Behavioral indicators • MAP_SHARED = 0x1

Pages can be shared between processes.

• PRIVATE_MAP = 0x2

Pages are not shared with other processes.

• MAP_ANONIMOUS = 0x20 • MAP_FIXED = 0x10

The pages do not correspond to any files on the file system.

Don't interpret addr as a hint, but as a command. If we can't D map the pages starting at this address, it will fail.

■■Note  To use the MAP_ANONYMOUS flag, it may be necessary to set the _DEFAULT_SOURCE flag immediately before including the relevant header file, as follows: #define _DEFAULT_SOURCE #include

20.6 munmap int munmap(void *address, size_t length); Description Deallocate a region of memory of a given length. You can map a large region using mmap and then unmap a fraction of it using munmap. rax

rdi

rs

11

replace* address

t_size longo

rdx

r10

r8

r9

20.6.1 Arguments 1. direction 2. length

Start of region to deallocate. Longitude of the region to unmap.

Returns rax = zero on success, -1 on failure. The errno global variable contains the error code. Includes for use in C: #include

419

Chapter 20 ■ Appendix C. System Calls

20.7 exit void _exit(int status); Description Exit the process. rax

rdi

rs

60

int state

rdx

r10

20.7.1 Arguments 1. state

Exit code. It is stored in $? environmental variable.

It returns nothing. Includes for use in C: #include

420

r8

r9

CHAPTER 21

Appendix D. Benchmark Information All benchmarks were run on the following system: > uname -a Linux perseus 3.16-2-amd64 #1 SMP Debian 3.16.3-2 (2014-09-20) x86_64 GNU/ Linux > cat / proc / cpuinfo processor: 0 vendor_id: GenuineIntel cpu family: 6 model: 69 model name: Intel(R) Core(TM) i5-4210U 1.70GHz CPU stepper: 1 microcode: 0x1d cpu MHz: 2394.458 cache size: 3072 KB physical id: 0 brothers: 1 core id: 0 cpu cores: 1 apicid: 0 initial apicid: 0 fpu: yes fpu_exception: yes cpuid level: 13 wp: yes indicators: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase smep

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_21

421

Chapter 21 ■ Appendix D. Performance Test Information

bogomips: clflush size: cache_alignment: address sizes: power management:

4788.91 64 64 40 bit physical, 48 bit virtual

processor: 1 vendor_id: GenuineIntel cpu family: 6 model: 69 model name: Intel(R) Core(TM) i5-4210U 1.70GHz CPU stepper: 1 microcode: 0x1d cpu MHz: 2394.458 cache size: 3072 KB physical id: 2 brothers : 1 core id: 0 cpu cores: 1 apicid: 2 initial apicid: 2 fpu: yes fpu_exception: yes cpuid level: 13 wp: yes flags: fpu vme from pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm ida arat epb pln pts dtherm fsgsbase smep bogomips: 4788.91 clflush size: 64 cache_alignment: 64 address sizes: 40-bit physical, 48-bit virtual power management: > cat /proc/meminfo MemTotal:1017348 MemFree:516672 MemAvailable: 565600 Buffers:32756 Cached:114944 SwapCached:10044 Active: 37 6288 Inactive:49624 Active(anon):26 6428 Inactive(anon):124 40 Active(file): 109860

422

kb kb kb kb kb kb kb kb kb kb kb kb

Chapter 21 ■ Appendix D. Performance Test Information

Inativo (arquivo): 37184 kB Não desalojar: 0 kB Mlocked: 0 kB SwapTotal: 901116 kB SwapFree: 868356 kB Dirty: 44 kB Writeback: 0 kB AnonPages: 270964 kB Mapeado: 43852 kB Shmem: 648 kB Slab: 45980 kB SReclaimable: 29016 kB SUnreclaim:16964 kB KernelStack:4192 kB PageTables:6100 kB NFS_Unstable:0 kB Bounce:0 kB WritebackTmp:0 kB CommitLimit:1409788 kB Committed_AS:1212356 kB VmallocTotal:34359738367 kB VmallocUsed:145144 kB VmallocChunk:34359590172 kB HardwareCorrupted:0 kB AnonHugePages :0 kB HugePages_Total:0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize:2048 kB DirectMap4k:49024 kB DirectMap2M:999424 kB DirectMap1G:0 kB

423

CHAPTER 22

Bibliography [1] Debugging with gdb. Available: http://sourceware.org/gdb/current/onlinedocs/gdb/. 2017. [2] Gnu Compilation Handbook. Available: www.gnu.org/software/make/manual/. 2016. [3] How initialization functions are handled. Available: https://gcc.gnu.org/onlinedocs/gccnt/Initialization.html. 2017. [4] Using ld, the gnu linker. Available: www.math.utah.edu/docs/info/ld_3.html. 1994. [5] What is map-reduce? Available: www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/. 2017. [6] Jeff Andrews. Reorganization of branches and loops to avoid wrong predictions. Available: https://software.intel.com/en-us/articles/branch-and-loopreorganization-to-prevent-mispredicts. May 2011. [7] Language Standard C11: Committee Draft. www.open-std.org/jtc1/sc22/wg14/www/standards. April 2011. [8] Luca Cardelli and Peter Wegner. In understanding types, data abstraction and polymorphism. MCA calculation. Survive 17(4):471–523. December 1985. [9] Ryan A. Chapman. Linux 3.2.0-33, x86 64 syscall table. www.cs.utexas. edu/~bismith/test/syscalls/syscalls64_orig.html. [10] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to algorithms. New York: McGraw-Hill Higher Education, 2nd ed., 2001. [11] Russ Cox. Regular expression matching can be simple and fast. https://change. com/~rsc/regexp/regexp1.html. November 2007. [12] Ulrich Drepper. What every programmer should know about memory. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf. November 2007. [13] Ulrich Drepper. How to write shared libraries. https://software.intel.com/sites/default/files/m/a/1/e/dsohowto.pdf. December 2011. [14] Jens Gustedt. Myth and reality about inline in c99. https://gustedt.wordpress. com/2010/11/29/myth-and-reality-about-inline-in-c99/. 2010.

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8_22

425

Chapter 22 ■ Bibliography

[15] Intel Corporation. Intel® 64 and IA-32 Architectures Software Developer's Guide. Available: www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf. September 2014. [16] Intel Corporation. Intel® 64 and IA-32 Architecture Optimization Reference Manual. Available: www.intel.com/content/www/us/en/architecture-andtechnology/64-ia-32-architectures-optimization-manual.html. June 2016. [17] David Kanter. Intel's Haswell CPU microarchitecture. Available: www.realworldtech.com/haswell-cpu/1. [18] Brian W. Kernighan. The C programming language. Prentice Hall Professional Technical Reference, 2nd edition, 1988. [19] Petter Larsson and Eric Palmer. Image rendering acceleration techniques using Intel Streaming Simd Extensions and Intel Advanced Vector Extensions. January 2010. [20] Doug Lea. A memory allocator. http://g.oswego.edu/dl/html/malloc.html. 2000. [21] Michael E. Lee. Optimization of computer programs in c. Available: http://leto.net/docs/C-optimization.php. [22] LOMONT, Chris. "Fast inverse square root". Technical Report-315 (2003): February 32, 2003. [23] Love, Robert. Linux kernel development (Novell Press). Novell Press, 2005. [24] Michael Matz, Jan Hubicka, Andreas Jaeger, and Mark Mitchell. System V application binary interface. AMD64 architecture processor add-in. Preliminary Release 0.99.6, 2013. [25] McKenney, Paul E. "Memory Barriers: A Hardware View from Software Hackers." Linux Technology Center, IBM Beaverton (2010). [26] Pawell Moll. How do debuggers (really) work? At Embedded Linux Conference Europe, October 2015. http://events.linuxfoundation.org/sites/events/files/slides/slides_16.pdf. [27] The Assembler of the Whole Network: NASM Handbook. Available: www.nasm.us/doc/. [28] N.N. Nepeyvoda and I.N. Skopin. Programming fundamentals. RHD Moscow-Izhevsk, 2003. [29] Benjamin C. Pierce. Programming types and languages. Cambridge, MA: MIT Press, 1st ed. 2002. [30] Jeff Preshing. The purpose of memory order consumption in c++11. http://preshing. com/20140709/the-purpose-of-memory_order_consume-in-cpp11/. 2014. [31] Jeff Preshing. Weak vs Strong Memory Models http://preshing.com/20120930/weak-vs-strong-memory-models/. 2012. [32] Brad Rodriguez. Moving Forward: A series on writing Forth cores. http://www. bradrodriguez.com/papers/moving1.html. The Computer Journal #59 (January/February 1993).

426

Chapter 22 ■ Bibliography

[33] Uresh, Vahalia. "UNIX Internals: The New Frontiers". (2005). Dorling Kindersley SA Limited. 2008. [34] Anthony Williams. C++ Concurrency in Action: Practical Multithreading. Shelter Island, New York: Manning. 2012. [35] Glynn Winskel. The formal semantics of programming languages: an introduction. Cambridge, MA: MIT Press. 1993.

427

Index

A Abstract Machines, 3 Abstract Syntax Tree, 230–231 Code Header File Access, 187–188 Addressing, 23–24, 30–31, 36 base-indexed with scaling and offset, 31 direct, 31 immediate, 23, 31 Indirection, 23 Address Space, 48 Address Space Layout Randomization (ASLR), 285 Alignment, 235–239 ​​​DeP and EXB Address Translation Process, 55 Page Table Entry, 55 PML4, 53 Segmentation Faults , 55 TLB, 55 Array, 147, 153 – 158, 160, 162–169, 179, 209–210 defined, 153 initializers, 154 memory allocators, 153 add functionality error, 163 const qualifiers, 166 code dictionary implementation assembly, 87–89 dynamic library, 307–308 GCC and NASM, 308 assembly language, 4 constant precalculations, 30 endianness, 28–29 function calls, 25–28 mov instruction, 20–21 syscall, 20–21 label , 19 output register local labels, 23 rax value, 22–23

relative addressing, 23–24 pointer, 30 string length calculation, 32–33 strings, 29 assembly preprocessor, 64–73 conditionals, 66–69 %define, 64–67, 69–70 macros with arguments, 65 –66, 68– 69 declarations, 251

B Backus-Naur Form (BNF), 222 Binutils, 77 BitMaP Format (BMP), 256–257 BNF (Backus-Naur Form), 222 Booleans, 150 Branch Prediction, 338 Breakpoint, 399

C C C89, 130 compilation, 130 Fibonacci series control flow, 138 for, 135–136 switch, 137–138 while, 135 hovering over, 134 data types, 132–133 Duff device, 137 expressions, 139, 2 functions, 142–144 main function, 156–157 preprocessor, 144–145 block, 143–144 #define, 144–145 #endif, 144 #ifndef, 144 #include, 144–145

© Igor Zhirkov 2017 I. Zhirkov, Low Level Programming, DOI 10.1007/978-1-4842-2403-8

429

■ TABLE OF CONTENTS

C (cont.) program structure, 130–132 instructions, 139–142 block, 131, 140 C11, 238–239 alignment, 238–239 Cache, 47 binary fetch, prefetch, 342–345 cache lines, 341 cache leak, 341 LL-cache, 341 array initialization, 346–348 memory, 341 memory bypass, 345 prefetch, 341 use case, 340 calling convention, 266–268 variable argument count ellipses, 271 Function 303–305 Chomsky Hierarchy, 229–230 CISC, 45 Atomic C11 memory model, 391 Intel 64, 390 memory requests, 392 operations, 392–393 kernel code models, 316 Large PIC, 320–322 Sin PIC, 318 medium, 319 PIC, 322–323 no PIC, 318 –319 small PIC, 319 no PIC, 317 code reuse, 255 coding style characteristics, 241–242 file structure, 243 functions, 246 integer types, 241 naming , 242 Types, 243–244 Variables, 244–245 Build process preprocessor (see preprocessor), 74 Compiler, 64 Condition variables, 379, 381 const types, 158–160 Context-free grammars, 229 Context-sensitive grammars, 229

430

D Data Execution Prevention (DEP), 288 Data Models, 213–214 Data Flows, 215–217 Data Structure Padding, 235–238, 244 Static Data Types, weakly typing, 173 strong typing, 172 typing weak, 172 Deadlocks, 377 –378 Debug, 399 Direct declaration statements, 182 functions, 182–183 incomplete type, 183 structure, 183 Descriptor privilege level, 41, 42 Directive, 19 Distributed factorization, 370, 372–374 function dynamic array scanf, 195 dynamic library, 293 – 294 function call, 303–305 .dynstr, 83 .dynsym, 83 .hash, 83 optimizations, 313–315 dynamic linker, 305–306 dynamic memory allocation, 195

ELF (Executable and Linkable Format), 74–76, 291 file type, 74, 76, 89 headers, 75 .dynstr, 83 .dynsym, 83–84 .hash, 83–84 execution view, 76 link view , 75– 76 Program header table, 76 section controls, 76 sections, 76 .bss, 76 .data, 76, 293, 295, 307, 313 .debug, 76 .dynstr, 83 .dynsym, 83 run view , 76 .fini, 311 . hash, 83 .init, 311–312 .line, 75–76, 317, 403 link view, 75–76

■ TABLE OF CONTENTS

.rel.data, 76 .rel.text, 76 .rodata, 76, 83–84, 207, 211–212, 296 .strtab, 76, 84 .symtab, 76, 84–85 .text, 76–79, 82 , 84–86 section table, 75 segments, 76, 86 structure, 75–76 Encapsulation, 248–251 Enumerations, 171 Error handling, 252–254, 256 Executable object file, 74, 80–81 Execution unit, 339–340 External variable, access, 306

F Fibonacci Series, 138 File, 18 File Descriptor, 215 Files and Documentation, 246–247 File Structure, 243 Finite State Machines, 136–137 Bit Parity, 103 Definition, 101 Limitation, 105 Regular Expressions, 106–108 undefined behavior, 101 verification techniques, 106 forbidden addresses, 50–52 formal grammars, 221–231 arithmetic, 224, 227–228 arithmetic with priorities, 227–228 Chomsky hierarchy, 229–230 imperative language, 229 natural numbers, 223 nonterminals, 222–227 recursive descent, 224–227 symbols, 222 terminals, 222 Forth machine architecture, 109–111 bootstrap, 123–124 compilation, 121–122 compiler, 117 conditional, 117 dictionary, 112 execution token, 113 –114 immediate indicator, 117 immediate words, 117 indirection code, 112–113, 115 internal interpreter, 114, 123 native word, 110, 112–114 external interpreter, 123

PC, 112, 114–115, 117 PC register, 112 pseudocode, 120 quadratic equation manipulation, 111 static dictionary, 118–121 word list, 120–121 word implementation, 112 W register, 112 direct statement, 182 function , 142–144 , 246 Functional Types, 160–161 Function Call Sequence Record Saved by Callee, 26 Record Saved by Caller, 26 Calling Convention, 266–268 Red Zone, 271 Return Address, 25 Returns a Value, 26 system calls, 27 variable argument count, 271–273 vprintf, 273 XMM registers, 265 function prototypes, 182

G gdb autocompletion, 400 breakpoint, 405 call stack, 404–406 commands, 401 FMT, 402 -ggdb, 403 intel syntax, 400 rax record, 399–400 stack, 404 Global Descriptor Table (GDT), 41, 44 , 94, 98 Global Remuneration Table, 291, 294, 295 Good Code Practices, 241–262

H Header Files, 187–188 Heap, 49 Higher Order Functions, 217–219

I, J Image Rotation Architecture, 109 BitMaP Format (BMP), 256–257 Immutability, 251 Implicit Conversions, 150–151 Inclusion Protection, 192–193 Incomplete Type, 183, 249

431

■ TABLE OF CONTENTS

Code chained downstream, 112, 113, 115 Online, 280–281 Input/output (I/O) ports tr register, 92–93 Instruction cache, 47 Instruction decoder, 46 Integer promotion, 150 Intel architecture 64, 6 errors, 387 restrictions, 385 main function, 387 reordering, 385–386 interrupts #BP, 96 descriptor, 94 error code, 96 #GP, 96 IDTR register, 94, 96 interrupt descriptor, 94, 96 interrupt interrupt descriptor table (IDT), 94 interrupt handler, 94 interrupt stack table, 93 instruction iretq, 97 #PF, 96 Intermediate Representation (IR), 74

Kernel K code template, 316

L Lazy Memory Allocation, 274 LD_PRELOAD, 192, 301, 302 , 197 Linked Objects, 310–313 Linker, 64 Symbol, 188 Linkage, 74–77 Alignment, 78, 86 Libraries (see Dynamic Library) Livelocks, 378–379 Loader , 64, 85–87 Reference location, 8 No programming lock, 388–390 Logical address, 49 Long mode, 44

432

threading, 44 longjmp, 276–277,280. See also setjmp lvalue, 140 expression, 140, 146 statement, 140, 146

M Machine Word, 20 Macro Expansion, 64 Macro Instances, 64 Macros, 64–65 Main Function, 156–157 Makefile, 246, 261, 409–410 Automatic Variables, 412–413 Malloc Implementation, 328 Map- Technique reduce, 348 Memory allocation, 254 automatic, 207 dynamic, 207 static, 207 Memory allocator, 259–261 Memory barrier, 363–364 Memory leak, 208 Memory management unit (MMU), 49 Memory allocation/memory allocation, 50 Null-terminated string, 59 Memory model allocation, dynamic, 207 memory leak, 208 empty pointer*, 208 memory region, 206–207 model computation, 101–126 model-specific registers (MSRs), 97 module, 74 , 76 multithreaded execution order, 358–359 memory barrier, 363–364 POSIXthreads processes (see Pthreads), 357 reordering, 360–362 strong and weak memory model, 359–360 threads, 357 use cases, 365 volatiles, 362–363 Mutexes, 374–376

N Namespaces, 168 Naming Convention, 242 Natural Numbers, 223 Non-Deterministic Finite Automata (NFA), 107 Numeric Types, 147–149

■ TABLE OF CONTENTS

The Object File, 74–82, 88 Operator Size, 157–158 Optimizations, 313–315 Branch Prediction, 338 Compiler Flags, 329 Constant Propagation, 334–335 Execution Unit, 339–340 Read-Write Set, 340 Low level, 329 performance tests, 327 profiler, 328–329 return value, 336–338 stack frame pointer omission, 329–330 underexpression removal, 333–334 tail recursion, 330–333

P, Q Combination Parsers, 227 Complex Definition Parsing, 211 Physical Address, 48, 52, 53, 55 Pointer Arithmetic, 202–203 Pointers, 151–152 array, 210 function, 205 NULL, 203–204 ptrdiff4–t , 203 205 void*, 203 Polymorphism casts, 179 definition, 174 inclusion, 177–178 overloading, 178–179 parametric, 175–177 Position-independent code (PIC), 82, 293 Pragmatics, 235–238 alignment, 235 data structure of padding, 235 – 238 Preload, 301–302 Preprocessor, 64, 144–145 condition argument type, 68–69 in definition, 67 text identity, 67–68 %define, 64 #define directive, 190 order evaluation, 69–70 #ifdef, 191 include guard, 192–193 tags inside macros, 72–73 traps, 194 #pragma once, 192 repeat, 70–71

substitutions with arguments, 65 conditionals, 66 macros, 64–65 symbols, 64 Prime number, 71–72, 138, 167 Prime number checker, 167 Procedure, 142–143 Programming language, 221 Protected mode, 42–43 far hop , 42 GDT/LDT, 41 RPL, 41 Segment Descriptor, 42 Segment Selector, 41 Protection Rings, 14, 39, 41, 43, 44, 46, 92, 93 #BP, 96 #GP, 96 #PF, 96 #UD, 96 Pthreads, 379 Condition Variables, 381 deadlocks, 377–378 distributed factoring, 370–374 bindable threading, 370 livelocks, 378–379 multithreading, use case, 365 mutexes, 374–376 semaphore, 382 , 384 defined attribute function, 368 –369 spinlocks, 381–382 synchronization, 369 threads, creation, 366–368 ptrdiff_t, 204

R Real mode, 39–40 segments, 39 segment registers, 39, 40 Reduced instruction set computer (RISC), 45–46 Register leads, 8 saved calls, 26, 33, 266, 321 CISC and RISC, 45 decoder instructions, 46 reference location, 8 rax decomposition, 12, 416 rflags, 12 rip, 12 rsi and rdi decomposition, 13 rsp and rbp decomposition, 13 segments, 13 regular expressions, 222, 229, 231 relocatable object files, 74, 76–80

433

■ TABLE OF CONTENTS

Relocations, 74–76, 82–83, 291–293, 304 Relocation table, 75–76, 79, 81, 86 Reaccessible, 245 restricted, 281–282 rvalue,140. See also lvalue

S scanf, 195–197 Dot product, 166 Security address space, 288 DEP, 288 output format functions, 285–287 return to libc, 285 stack buffer overflow, 284 Security cookie, 287–288 Error segmentation, 49, 52, 55 Segment selector, 41 Semantics, 231–234 implementation-defined behavior, 234 sequence points, 234 undefined behavior, 232–233 unspecified behavior, 233 Semaphore, 382, ​​384 setjmp, 276 –280. See also longjmp longjmp, 276 volatile, 277–279 Shadow register, 43, 92, 98 Shared object file, 75, 81 lookup scope, 292 Shellcode, 285 SIMD Instruction Class, 348 Signal, 49 SIGSEGV, 49 sizeof, 150–151, 157, 169, 179 Source file structure, 246 Spatial locale, 8 Speedup, 374 Spinlocks, 381–382 SSE and AVX extensions, 349 addps and movdqa, 350 ALU, 351 AVX, 351 movdqa and mulps, 350 wrapping, 350 sepia filter , 351–354 Standard Library, 188–190 Statements, 140–141 Static, 189, 194, 199 Static Keyword, 198–199 Static Libraries, 81, 82 stderr, 18 stdin, 18 stdout, 18 Strict Alias ​​Rule, 283

434

String Internation, 213 String Length Manipulation, 32–33 String Literals, 211–213 Zero-Terminated Strings, 29, 160, 30, 32, 35 Structures, 167–169 Symbol Resolution, 75 Symbol Table, 76 –80, 85 Syntax, 221–231 abstract, 230, 231 concrete, 230 sysret, 97–99 system calls, 18, 97–99 close, 417–418 exit, 420 mmap, 56, 57 model-specific records, 97 munmap, 419 open, 56 , 59 read, 415 write, 18, 20

T Task State Segment (TSS), 91–93 Temporary Location, 8 Tokens, 231 Translation, 74 Translation Lookaside Buffer, 210 Type Aliases, 155–156 Type Conversion, 149–150 typedef, 211 Type System Booleans, 150 implicit conversions, 150 –151 numeric types, 147–149 pointers, 151–152 type conversion, 149–150 Write, 172 dynamic write, 173 explicit write, 172 implicit write, 172 static write, 172 weak write, 173

U Undefined behavior, 101 Unions, 169–171

V, W Verification, 106 Virtual address, 49 Canonical address, 53 Virtual memory

■ TABLE OF CONTENTS

address space, 48 address translation process, 55 PML4, 53 allocation, 50 bus error, 53 cache, 47 efficiency, 52 address access prohibited, 50–51 allocation (see (Memory allocation)) memory map, 50 pages, 49 anonymous pages, 49, 50, 56 frames, 53 sizes, 56 regions, 50 replacement strategies, 50 swap files, 49 virtual address structures, 53 working sets, 47

(Video) How to Download Paid Pdf Book Free [Updated-2022]

Volatile memory allocation, 274 GCC volatile variables, 275 pointers, 274 von Neumann architecture advantages, 4 assembly languages, 4 extensions, 7 hardware stacks, 6–7 interrupts, 6–7 memories, 4–5 protection rings , 6–7 registers , 6–7 virtual memory, 6, 7

X, Y, Z registers XMM, 265

435

Videos

1. I Paid $100 For An ATHLEAN X Program | WASTE OF MONEY??
(Will Tennyson)
2. Sumeet Sahni's 8 Week Program Review | Why I Asked For A Refund..
(Ellie Jane Tucker)
3. Old School Mass Gain Training – 4 Day Per Week Workout Split
(Peter Khatcherian)
4. WD Gann Trading Techniques: Part10: Tunnel thru the Air: Book Review: 7 Secrets revealed: Astrology
(Unit of Technical Analysis for trading)
5. MMC - New Version -Episode 1(Powerful forex strategy)
(Shule ya forex)
6. How To Get Bigger & Stronger At The Same Time (Powerbuilding Science Explained)
(Jeff Nippard)

References

Top Articles
Latest Posts
Article information

Author: Nicola Considine CPA

Last Updated: 20/09/2023

Views: 5243

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Nicola Considine CPA

Birthday: 1993-02-26

Address: 3809 Clinton Inlet, East Aleisha, UT 46318-2392

Phone: +2681424145499

Job: Government Technician

Hobby: Calligraphy, Lego building, Worldbuilding, Shooting, Bird watching, Shopping, Cooking

Introduction: My name is Nicola Considine CPA, I am a determined, witty, powerful, brainy, open, smiling, proud person who loves writing and wants to share my knowledge and understanding with you.