mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-07-05 07:51:38 -04:00
docs: Add clear device notes
This commit is contained in:
parent
618ecfb0c9
commit
973933d974
@ -53,7 +53,7 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
|||||||
| Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* |
|
| Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* |
|
||||||
| Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** |
|
| Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** |
|
||||||
| Deployment | Complex | **Copy & run** | ✅ **Copy & run** |
|
| Deployment | Complex | **Copy & run** | ✅ **Copy & run** |
|
||||||
| Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | ✅ **2.1ms (1000+ GFLOPS)** |
|
| Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | ✅ **2.1ms (1000+ GFLOPS/M1 Macbook)** |
|
||||||
|
|
||||||
*See [experimental benchmarks](experimental/README.md#benchmarks) for current performance measurements.*
|
*See [experimental benchmarks](experimental/README.md#benchmarks) for current performance measurements.*
|
||||||
|
|
||||||
@ -103,7 +103,7 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
|||||||
- [x] **Updated to Zig 0.15.0-dev - compiles cleanly**
|
- [x] **Updated to Zig 0.15.0-dev - compiles cleanly**
|
||||||
- [x] **Benchmark suite** showing current performance
|
- [x] **Benchmark suite** showing current performance
|
||||||
- [x] **BLAS integration working** - Apple Accelerate backend functional
|
- [x] **BLAS integration working** - Apple Accelerate backend functional
|
||||||
- [x] **Improved matrix performance** - 1000+ GFLOPS operations
|
- [x] **Improved matrix performance** - 1000+ GFLOPS operations on an M1 Macbook
|
||||||
|
|
||||||
*📈 Performance improvement achieved - BLAS acceleration now working*
|
*📈 Performance improvement achieved - BLAS acceleration now working*
|
||||||
|
|
||||||
|
@ -13,7 +13,7 @@ A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/)
|
|||||||
> - ✅ **Functional matrix operations** (significant performance improvement)
|
> - ✅ **Functional matrix operations** (significant performance improvement)
|
||||||
>
|
>
|
||||||
> **Recent Progress**: Matrix operations now use BLAS acceleration<br/>
|
> **Recent Progress**: Matrix operations now use BLAS acceleration<br/>
|
||||||
> **Performance Status**: 1160+ GFLOPS with Apple Accelerate backend working (measured on Apple M1)<br/>
|
> **Performance Status**: 1160+ GFLOPS with Apple Accelerate backend working (measured on Apple M1 Macbook)<br/>
|
||||||
>
|
>
|
||||||
> See [Performance Results](#performance-notes) for detailed benchmarks.
|
> See [Performance Results](#performance-notes) for detailed benchmarks.
|
||||||
|
|
||||||
@ -27,7 +27,7 @@ This experimental implementation aims to leverage Zig's unique advantages for sy
|
|||||||
- **Single binary deployment** with no runtime dependencies
|
- **Single binary deployment** with no runtime dependencies
|
||||||
- **Cross-platform compilation** for multiple architectures
|
- **Cross-platform compilation** for multiple architectures
|
||||||
|
|
||||||
**🚀 BLAS Acceleration Achieved!** We've successfully integrated Apple Accelerate backend delivering **1000+ GFLOPS** performance - a **3000x speedup** over the initial naive implementation.
|
**🚀 BLAS Acceleration Achieved!** We've successfully integrated Apple Accelerate backend delivering **1000+ GFLOPS** performance - a **3000x speedup** over the initial naive implementation. Measured on an M1 Macbook.
|
||||||
|
|
||||||
**🔗 Related**: See the [main project README](../README.md) for architecture overview and vision.
|
**🔗 Related**: See the [main project README](../README.md) for architecture overview and vision.
|
||||||
|
|
||||||
@ -309,7 +309,7 @@ This experimental implementation follows the same license as the original DeepSe
|
|||||||
- **Matrix 1024×1024**: 2.1ms/iter, **1004 GFLOPS** (38.6% efficiency)
|
- **Matrix 1024×1024**: 2.1ms/iter, **1004 GFLOPS** (38.6% efficiency)
|
||||||
- **Matrix 2048×2048**: 21.5ms/iter, **799 GFLOPS** (30.7% efficiency)
|
- **Matrix 2048×2048**: 21.5ms/iter, **799 GFLOPS** (30.7% efficiency)
|
||||||
|
|
||||||
**Performance Improvement**: From **6418ms naive** → **2.1ms BLAS** = significant speedup for matrix operations
|
**Performance Improvement**: From **6418ms naive** → **2.1ms BLAS** = significant speedup for matrix operations. Measured on an M1 Macbook.
|
||||||
|
|
||||||
**System Status**:
|
**System Status**:
|
||||||
- ✅ **BLAS Backend**: Apple Accelerate integration working
|
- ✅ **BLAS Backend**: Apple Accelerate integration working
|
||||||
|
Loading…
Reference in New Issue
Block a user