docs: Update benchmarks

This commit is contained in:
Triex 2025-06-11 21:24:34 +10:00
parent 973933d974
commit c24c4dc1eb
2 changed files with 52 additions and 23 deletions

View File

@ -33,7 +33,7 @@ A **DRAFT proposal & foundation** for implementing DeepSeek V3 in Zig to create
- ✅ **Improved matrix operations** (1000+ GFLOPS performance on an M1 Macbook) - ✅ **Improved matrix operations** (1000+ GFLOPS performance on an M1 Macbook)
- ⚠️ **NOT PRODUCTION READY** - Draft implementation for research/development - ⚠️ **NOT PRODUCTION READY** - Draft implementation for research/development
**Performance Update**: ~~Current naive algorithms are ~1000x slower than optimized BLAS~~ **BLAS integration now functional.** Matrix multiplication: **2.1ms for 1024×1024** at **1000+ GFLOPS** on an M1 Macbook. This represents significant improvement over our initial naive implementation. See [experimental benchmarks](experimental/README.md#benchmarks) for detailed performance data. **Performance Update**: ~~Current naive algorithms are ~1000x slower than optimized BLAS~~ **BLAS integration now functional.** Matrix multiplication: **2.1ms for 1024×1024** at **1164 GFLOPS**, with peak **1084 GFLOPS at 512×512** on an M1 MacBook Pro under heavy load. This represents a ~**3000x speedup** over our initial naive implementation. See [experimental benchmarks](experimental/README.md#benchmarks) for detailed performance data.
## Why This Matters ## Why This Matters
@ -53,9 +53,10 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
| Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* | | Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* |
| Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** | | Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** |
| Deployment | Complex | **Copy & run** | ✅ **Copy & run** | | Deployment | Complex | **Copy & run** | ✅ **Copy & run** |
| Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | **2.1ms (1000+ GFLOPS/M1 Macbook)** | | Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | **2.1ms (1164 GFLOPS)** |
| Peak Performance | ~1500 GFLOPS | **> 1000 GFLOPS** | ✅ **1164 GFLOPS** |
*See [experimental benchmarks](experimental/README.md#benchmarks) for current performance measurements.* *Benchmarked on Apple M1 MacBook Pro under heavy load*
## Why Zig? ## Why Zig?

View File

@ -243,21 +243,49 @@ Example output:
🚀 DeepZig V3 Performance Benchmarks 🚀 DeepZig V3 Performance Benchmarks
========================================== ==========================================
Backend: CPU (BLAS accelerated) 🎯 DYNAMIC BENCHMARK SUMMARY
Architecture: aarch64 ===============================
Thread count: 8
Hardware: Apple M1 MacBook Pro, 16GB unified memory
Operation | Iterations | Avg Time | Operations/s | Memory 📊 Matrix Multiplication Performance:
-------------------------------|------------|-----------|--------------|------- • 256×256: 0.0 ms, 937 GFLOPS
Tensor Creation (1024x1024) | 1000 iter | 2.03 ms | 493 ops/s | 4.0 MB • 512×512: 0.2 ms, 1084 GFLOPS
Tensor Addition (SIMD) | 100 iter | 1.49 ms | 2806962690 ops/s | 48.0 MB • 1024×1024: 2.1 ms, 1164 GFLOPS
Matrix Multiplication (BLAS) | 10 iter | 2.1 ms | 1164 GFLOPS | 12.0 MB • 2048×2048: 20.9 ms, 823 GFLOPS
SwiGLU Activation | 1000 iter | 4.44 ms | 236002478 ops/s | 12.0 MB 🏆 Peak measured: 1164 GFLOPS at 1024×1024
RMS Normalization (SIMD) | 1000 iter | 0.00 ms | 1077586 ops/s | 0.0 MB
Memory Bandwidth | 100 iter | 4.92 ms | 13 ops/s | 128.0 MB 🧮 BLAS Configuration:
• Backend: Apple Accelerate
• Theoretical peak: 2600 GFLOPS (estimated)
Tensor Operations:
• SIMD Addition: 3.5 GB/s
💾 Memory Performance:
• Copy Bandwidth: 20.9 GB/s
• Random Access Latency: 1.8 ns
🎯 Performance Assessment:
✅ Acceptable: BLAS delivering 1000+ GFLOPS
• Est. efficiency: 44% (vs theoretical peak)
Note: Benchmarked on Apple M1 MacBook Pro under heavy load
(should be significantly higher on a clean system).
``` ```
**Performance Results** (Apple M1 MacBook Pro under heavy load):
- **Matrix 256×256**: 0.0ms/iter, **937 GFLOPS**
- **Matrix 512×512**: 0.2ms/iter, **1084 GFLOPS** (peak performance)
- **Matrix 1024×1024**: 2.1ms/iter, **1164 GFLOPS**
- **Matrix 2048×2048**: 20.9ms/iter, **823 GFLOPS**
**Performance Achievement**: From **6418ms naive****2.2ms BLAS** = **2900x speedup** on matrix operations
**System Status**:
- ✅ **BLAS Backend**: Apple Accelerate integration delivering acceptable performance
- ✅ **Peak Performance**: **1164 GFLOPS measured** (44% of theoretical maximum, impressive under load)
- ✅ **Memory Bandwidth**: 20.9 GB/s copying, well-optimized operations
- ✅ **Hardware Detection**: M-series Apple Silicon detection functional
## Known Issues ## Known Issues
- **Model Loading**: Currently creates dummy models - real weight loading not implemented - **Model Loading**: Currently creates dummy models - real weight loading not implemented
@ -303,18 +331,18 @@ This experimental implementation follows the same license as the original DeepSe
**Current Status**: ✅ **BLAS integration working** - Apple Accelerate backend now functional in draft implementation. **Current Status**: ✅ **BLAS integration working** - Apple Accelerate backend now functional in draft implementation.
**Performance Results** (Apple M1, Accelerate backend): **Performance Results** (Apple M1 MacBook Pro under heavy load):
- **Matrix 256×256**: 0.1ms/iter, **561 GFLOPS** (21.6% efficiency) - **Matrix 256×256**: 0.0ms/iter, **937 GFLOPS**
- **Matrix 512×512**: 0.2ms/iter, **1129 GFLOPS** (43.4% efficiency) - **Matrix 512×512**: 0.2ms/iter, **1084 GFLOPS**
- **Matrix 1024×1024**: 2.1ms/iter, **1004 GFLOPS** (38.6% efficiency) - **Matrix 1024×1024**: 2.1ms/iter, **1164 GFLOPS** (peak performance)
- **Matrix 2048×2048**: 21.5ms/iter, **799 GFLOPS** (30.7% efficiency) - **Matrix 2048×2048**: 20.9ms/iter, **823 GFLOPS**
**Performance Improvement**: From **6418ms naive****2.1ms BLAS** = significant speedup for matrix operations. Measured on an M1 Macbook. **Performance Achievement**: From **6418ms naive****2.1ms BLAS** = ~**3000x speedup** on matrix operations.
**System Status**: **System Status**:
- ✅ **BLAS Backend**: Apple Accelerate integration working - ✅ **BLAS Backend**: Apple Accelerate integration working
- ✅ **Efficiency**: 20-44% of theoretical maximum (good for draft implementation) - ✅ **Peak Performance**: **1164 GFLOPS measured** (44% of theoretical maximum)
- ✅ **Memory Bandwidth**: 23.5 GB/s copying, basic optimization - ✅ **Memory Bandwidth**: 20.9 GB/s copying, well-optimized operations
- ✅ **Hardware Detection**: M-series Apple Silicon detection functional - ✅ **Hardware Detection**: M-series Apple Silicon detection functional
**Next Steps**: Focus on transformer architecture, attention mechanisms, and model-specific optimizations for the draft DeepSeek V3 implementation. **Next Steps**: Focus on transformer architecture, attention mechanisms, and model-specific optimizations for the draft DeepSeek V3 implementation.