第23天：性能优化

学习目标

学习C++性能优化技巧，掌握编译器优化、内存优化和算法优化等方法，提高程序运行效率。

学习资源链接

📚 官方文档和教程

C++ Optimization Guide - C++优化指南(Agner Fog)
Intel C++ Optimization Guide - Intel C++优化指南
GCC Optimization Options - GCC编译器优化选项
Clang Optimization - Clang优化选项

🎥 视频教程

C++ Performance Optimization - C++性能优化技巧
The Cherno - C++ Performance - 性能优化实战
CppCon Performance Talks - CppCon性能优化演讲

📖 深入阅读

Optimized C++ - C++优化技术
High Performance Computing - 高性能计算
Computer Systems: A Programmer’s Perspective - 系统级性能优化

🔧 性能分析工具

Intel VTune Profiler - 性能分析工具
Valgrind - 内存分析和性能分析
Google Benchmark - C++基准测试库
perf - Linux性能分析工具

学习内容

1. 编译器优化选项

优化级别：-O0, -O1, -O2, -O3, -Os, -Ofast
特定优化：-finline-functions, -funroll-loops, -ffast-math
调试友好优化：-Og
链接时优化：-flto (Link Time Optimization)

# 基本优化编译
g++ -O2 -std=c++17 program.cpp -o program

# 最大优化（可能影响调试）
g++ -O3 -DNDEBUG -march=native program.cpp -o program

# 链接时优化
g++ -O3 -flto program.cpp -o program

2. 内联函数优化

内联函数的作用：消除函数调用开销
编译器内联决策：函数大小、调用频率
强制内联：__forceinline (MSVC), __attribute__((always_inline)) (GCC)

// 内联函数示例
class Vector3 {
private:
    float x, y, z;
    
public:
    // 简单函数适合内联
    inline float dot(const Vector3& other) const {
        return x * other.x + y * other.y + z * other.z;
    }
    
    // 复杂函数不适合内联
    Vector3 normalize() const; // 在.cpp中实现
};

// 模板函数自动内联
template<typename T>
constexpr T max(const T& a, const T& b) {
    return (a > b) ? a : b;
}

3. 缓存友好的代码

数据局部性：时间局部性和空间局部性
缓存行对齐：避免false sharing
数据结构优化：AoS vs SoA (Array of Structures vs Structure of Arrays)

// 缓存友好的矩阵乘法
void matrix_multiply_optimized(const std::vector<std::vector<double>>& A,
                              const std::vector<std::vector<double>>& B,
                              std::vector<std::vector<double>>& C) {
    size_t n = A.size();
    
    // 分块矩阵乘法，提高缓存命中率
    const size_t BLOCK_SIZE = 64;
    
    for (size_t bi = 0; bi < n; bi += BLOCK_SIZE) {
        for (size_t bj = 0; bj < n; bj += BLOCK_SIZE) {
            for (size_t bk = 0; bk < n; bk += BLOCK_SIZE) {
                // 块内计算
                for (size_t i = bi; i < std::min(bi + BLOCK_SIZE, n); ++i) {
                    for (size_t j = bj; j < std::min(bj + BLOCK_SIZE, n); ++j) {
                        for (size_t k = bk; k < std::min(bk + BLOCK_SIZE, n); ++k) {
                            C[i][j] += A[i][k] * B[k][j];
                        }
                    }
                }
            }
        }
    }
}

// SoA vs AoS 示例
// AoS (Array of Structures) - 缓存不友好
struct ParticleAoS {
    float x, y, z;    // 位置
    float vx, vy, vz; // 速度
    float mass;       // 质量
    int id;          // ID
};
std::vector<ParticleAoS> particles_aos;

// SoA (Structure of Arrays) - 缓存友好
struct ParticlesSoA {
    std::vector<float> x, y, z;        // 位置
    std::vector<float> vx, vy, vz;     // 速度
    std::vector<float> mass;           // 质量
    std::vector<int> id;               // ID
};

4. 避免不必要的拷贝

移动语义：使用std::move
引用传递：避免参数拷贝
返回值优化：RVO和NRVO
原地构造：emplace_back vs push_back

class LargeObject {
private:
    std::vector<int> data;
    
public:
    // 移动构造函数
    LargeObject(LargeObject&& other) noexcept 
        : data(std::move(other.data)) {}
    
    // 移动赋值运算符
    LargeObject& operator=(LargeObject&& other) noexcept {
        if (this != &other) {
            data = std::move(other.data);
        }
        return *this;
    }
};

// 避免拷贝的函数设计
class DataProcessor {
public:
    // 使用const引用避免拷贝
    void process(const std::vector<int>& input) {
        // 处理数据...
    }
    
    // 返回值优化 (RVO)
    std::vector<int> createData() {
        std::vector<int> result;
        // 填充数据...
        return result; // 编译器会优化掉拷贝
    }
    
    // 原地构造
    void addItems(std::vector<LargeObject>& container) {
        // 好的做法：原地构造
        container.emplace_back(/* 构造参数 */);
        
        // 避免：先构造再拷贝
        // container.push_back(LargeObject(/* 参数 */));
    }
};

5. 内存池技术

内存池概念：预分配内存块，减少malloc/free调用
对象池：重用对象实例
栈分配器：线性内存分配

// 简单内存池实现
template<typename T, size_t PoolSize>
class MemoryPool {
private:
    alignas(T) char pool[PoolSize * sizeof(T)];
    std::bitset<PoolSize> used;
    size_t next_free = 0;
    
public:
    template<typename... Args>
    T* allocate(Args&&... args) {
        // 找到空闲位置
        for (size_t i = next_free; i < PoolSize; ++i) {
            if (!used[i]) {
                used[i] = true;
                next_free = i + 1;
                return new(pool + i * sizeof(T)) T(std::forward<Args>(args)...);
            }
        }
        
        // 从头开始查找
        for (size_t i = 0; i < next_free; ++i) {
            if (!used[i]) {
                used[i] = true;
                next_free = i + 1;
                return new(pool + i * sizeof(T)) T(std::forward<Args>(args)...);
            }
        }
        
        return nullptr; // 池已满
    }
    
    void deallocate(T* ptr) {
        if (ptr >= reinterpret_cast<T*>(pool) && 
            ptr < reinterpret_cast<T*>(pool + sizeof(pool))) {
            ptr->~T();
            size_t index = (reinterpret_cast<char*>(ptr) - pool) / sizeof(T);
            used[index] = false;
            next_free = std::min(next_free, index);
        }
    }
};

// 使用示例
MemoryPool<std::string, 1000> string_pool;
auto* str = string_pool.allocate("Hello, World!");
// 使用字符串...
string_pool.deallocate(str);

6. 算法优化技巧

选择合适的数据结构：vector vs list vs deque
算法复杂度优化：减少时间复杂度
并行算法：使用std::execution policy

#include <execution>
#include <algorithm>
#include <numeric>

void optimization_examples() {
    std::vector<int> data(1000000);
    std::iota(data.begin(), data.end(), 1);
    
    // 并行排序
    std::sort(std::execution::par_unseq, data.begin(), data.end());
    
    // 并行累加
    auto sum = std::reduce(std::execution::par_unseq, 
                          data.begin(), data.end(), 0);
    
    // 并行变换
    std::transform(std::execution::par_unseq,
                   data.begin(), data.end(), data.begin(),
                   [](int x) { return x * x; });
}

// 数据结构选择示例
class PerformanceComparison {
public:
    // 顺序访问：vector最优
    void sequential_access() {
        std::vector<int> vec(10000);
        for (auto& item : vec) {
            item *= 2; // 缓存友好
        }
    }
    
    // 频繁插入删除：list可能更好
    void frequent_insertions() {
        std::list<int> lst;
        for (int i = 0; i < 1000; ++i) {
            lst.insert(std::next(lst.begin(), i/2), i);
        }
    }
    
    // 查找操作：set/map更优
    void frequent_lookups() {
        std::unordered_set<int> set;
        // O(1) 平均查找时间
        if (set.find(42) != set.end()) {
            // 找到元素
        }
    }
};

性能测试和分析

基准测试示例

#include <chrono>
#include <iostream>

class Timer {
private:
    std::chrono::high_resolution_clock::time_point start_time;
    
public:
    Timer() : start_time(std::chrono::high_resolution_clock::now()) {}
    
    ~Timer() {
        auto end_time = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>
                       (end_time - start_time);
        std::cout << "Time taken: " << duration.count() << " microseconds\n";
    }
};

void benchmark_example() {
    std::vector<int> data(1000000);
    std::iota(data.begin(), data.end(), 1);
    
    {
        Timer t;
        std::sort(data.begin(), data.end());
    } // Timer析构时输出时间
}

实践练习

练习1：矩阵运算优化

优化矩阵乘法算法：

实现分块矩阵乘法
比较不同优化级别的性能
使用性能分析工具测量

练习2：字符串处理优化

优化字符串处理函数：

减少内存分配
使用string_view减少拷贝
实现自定义字符串池

练习3：容器性能比较

比较不同容器的性能：

vector vs list vs deque
不同操作模式下的性能差异
内存使用情况分析

练习4：并行算法应用

使用并行算法优化计算密集型任务：

并行排序
并行数值计算
并行搜索算法

今日总结

通过学习性能优化，你应该掌握：

编译器优化选项的使用
缓存友好代码的编写技巧
内存管理和对象生命周期优化
算法和数据结构的性能考虑
性能测试和分析方法

明天预告

明天我们将学习调试与测试技巧，包括GDB调试器、内存检查工具和单元测试框架的使用。

返回第四周 | 上一天：第22天 | 下一天：第24天

C++ 学习计划 - 第23天:性能优化

第23天：性能优化

学习目标

学习资源链接

📚 官方文档和教程

🎥 视频教程

📖 深入阅读

🔧 性能分析工具

学习内容

1. 编译器优化选项

2. 内联函数优化

3. 缓存友好的代码

4. 避免不必要的拷贝

5. 内存池技术

6. 算法优化技巧

性能测试和分析

基准测试示例

实践练习

练习1：矩阵运算优化

练习2：字符串处理优化

练习3：容器性能比较

练习4：并行算法应用

今日总结

明天预告