mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-07-04 23:41:37 -04:00
Merge pull request #2 from Triex:feat-Enhanced-device-detection-handling-initial-metal
Feat-Enhanced-device-detection-handling-initial-metal
This commit is contained in:
commit
24d94f7c21
3
.gitignore
vendored
3
.gitignore
vendored
@ -172,4 +172,5 @@ cython_debug/
|
|||||||
.DS_Store
|
.DS_Store
|
||||||
|
|
||||||
# Zig
|
# Zig
|
||||||
experimental/.zig-cache/
|
experimental/.zig-cache/
|
||||||
|
zig-out/
|
136
README.md
136
README.md
@ -16,18 +16,22 @@
|
|||||||
</div>
|
</div>
|
||||||
<hr />
|
<hr />
|
||||||
|
|
||||||
<h1 align="center"> DeepZig V3: A High-Performance LLM Architecture</h1>
|
# DeepZig V3: A High-Performance LLM Architecture
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
A proposal & foundation for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
|
A **DRAFT proposal & foundation** for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
|
||||||
|
|
||||||
**Status Update**: ✅ **Foundation compiles clean theoretical implementation** with Zig 0.15.0-dev, including:
|
**⚠️ Status: EXPERIMENTAL DRAFT** ✅ **Foundation compiles with Zig 0.15.0-dev**, including:
|
||||||
HTTP server with modern Zig API
|
- ✅ HTTP server framework (basic structure)
|
||||||
- SIMD-optimized tensor operations
|
- ✅ SIMD-optimized tensor operations (draft implementation)
|
||||||
- Cross-platform backend architecture
|
- ✅ Cross-platform backend architecture
|
||||||
- Initial memory management
|
- ✅ Initial memory management
|
||||||
- Comprehensive build system draft
|
- ✅ **Apple Silicon M-series detection** (hardware detection via sysctl)
|
||||||
|
- ✅ Comprehensive build system draft
|
||||||
|
- ⚠️ **NOT PRODUCTION READY** - Draft implementation for research/development
|
||||||
|
|
||||||
|
**Performance Note**: Current naive algorithms are ~1000x slower than optimized BLAS. Matrix multiplication: 640ms for 1024×1024. This is expected for a foundational draft implementation. See [experimental benchmarks](experimental/README.md#benchmarks) for detailed performance data.
|
||||||
|
|
||||||
## Why This Matters
|
## Why This Matters
|
||||||
|
|
||||||
@ -37,6 +41,18 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
|||||||
- **Complex deployment** with heavy runtimes
|
- **Complex deployment** with heavy runtimes
|
||||||
- **Platform lock-in** due to dependency complexity
|
- **Platform lock-in** due to dependency complexity
|
||||||
|
|
||||||
|
## Expected Benefits vs Current Reality
|
||||||
|
|
||||||
|
| Aspect | Current (PyTorch) | Target (Zig) | **Current Draft** |
|
||||||
|
|--------|------------------|--------------|-------------------|
|
||||||
|
| Cold start | 10-30s | **< 2s** | *Not measured* |
|
||||||
|
| Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* |
|
||||||
|
| Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** |
|
||||||
|
| Deployment | Complex | **Copy & run** | ✅ **Copy & run** |
|
||||||
|
| Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | *6418ms (naive)* |
|
||||||
|
|
||||||
|
*See [experimental benchmarks](experimental/README.md#benchmarks) for current performance measurements.*
|
||||||
|
|
||||||
## Why Zig?
|
## Why Zig?
|
||||||
|
|
||||||
**Performance**: Zero-cost abstractions, compile-time optimization, direct hardware access<br/>
|
**Performance**: Zero-cost abstractions, compile-time optimization, direct hardware access<br/>
|
||||||
@ -56,14 +72,14 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
|||||||
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
||||||
```
|
```
|
||||||
|
|
||||||
## Proposed Web API
|
## Draft Web API Framework
|
||||||
|
|
||||||
### Target Endpoints
|
### Planned Endpoints (Basic Structure Implemented)
|
||||||
- `POST /v1/chat/completions` - OpenAI-compatible chat API
|
- `POST /v1/chat/completions` - OpenAI-compatible chat API
|
||||||
- `POST /v1/completions` - Text completion
|
- `POST /v1/completions` - Text completion
|
||||||
- `GET /v1/models` - List available models
|
- `GET /v1/models` - List available models
|
||||||
- `GET /health` - Service health check
|
- `GET /health` - Service health check
|
||||||
- `WebSocket /ws` - Streaming inference
|
- `WebSocket /ws` - Streaming inference (planned)
|
||||||
|
|
||||||
### Deployment Vision
|
### Deployment Vision
|
||||||
- **Static binaries** - Single file deployment, no dependencies
|
- **Static binaries** - Single file deployment, no dependencies
|
||||||
@ -72,56 +88,55 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
|||||||
- **Serverless functions** - Minimal cold start with static linking
|
- **Serverless functions** - Minimal cold start with static linking
|
||||||
- **WebAssembly** - Browser inference without additional runtime
|
- **WebAssembly** - Browser inference without additional runtime
|
||||||
|
|
||||||
## Implementation Plan
|
## Implementation Plan Status
|
||||||
|
|
||||||
### Phase 1: Foundation ✅ **DRAFTED**
|
### Phase 1: Foundation ✅ **DRAFT COMPLETE**
|
||||||
- [x] Set up Zig project structure
|
- [x] Set up Zig project structure
|
||||||
- [x] Implement basic tensor operations with SIMD
|
- [x] Implement basic tensor operations with SIMD
|
||||||
- [x] Create memory management system (arena allocators)
|
- [x] Create memory management system (arena allocators)
|
||||||
- [x] Build HTTP server framework
|
- [x] Build HTTP server framework
|
||||||
|
- [x] **Apple Silicon detection via sysctl calls**
|
||||||
- [x] **Updated to Zig 0.15.0-dev - compiles cleanly**
|
- [x] **Updated to Zig 0.15.0-dev - compiles cleanly**
|
||||||
|
- [x] **Benchmark suite** showing current performance
|
||||||
|
|
||||||
### Phase 2: Core Model
|
*📈 Performance baseline established - see [benchmarks](experimental/README.md#benchmarks)*
|
||||||
|
|
||||||
|
### Phase 2: Core Model (IN PROGRESS)
|
||||||
- [ ] Implement transformer layers
|
- [ ] Implement transformer layers
|
||||||
- [ ] Add Multi-Head Latent Attention (MLA)
|
- [ ] Add Multi-Head Latent Attention (MLA)
|
||||||
- [ ] Build Mixture of Experts (MoE) routing
|
- [ ] Build Mixture of Experts (MoE) routing
|
||||||
- [ ] Create tokenizer integration
|
- [ ] Create tokenizer integration
|
||||||
|
|
||||||
### Phase 3: Backends
|
### Phase 3: Backends (PLANNED)
|
||||||
- [ ] Optimize CPU backend with AVX/NEON
|
- [ ] Optimize CPU backend with AVX/NEON
|
||||||
- [ ] Integrate Metal for Apple Silicon
|
- [ ] Integrate Metal for Apple Silicon
|
||||||
- [ ] Add CUDA support for NVIDIA GPUs
|
- [ ] Add CUDA support for NVIDIA GPUs
|
||||||
- [ ] Implement WebGPU for browsers
|
- [ ] Implement WebGPU for browsers
|
||||||
|
|
||||||
### Phase 4: Web Integration
|
### Phase 4: Web Integration (DRAFT STRUCTURE)
|
||||||
- [x] Complete HTTP API implementation (basic structure)
|
- [x] Complete HTTP API implementation (basic structure)
|
||||||
- [ ] Add WebSocket streaming
|
- [ ] Add WebSocket streaming
|
||||||
- [ ] Build authentication/rate limiting
|
- [ ] Build authentication/rate limiting
|
||||||
- [ ] Create deployment tooling
|
- [ ] Create deployment tooling
|
||||||
|
|
||||||
## Expected Benefits
|
|
||||||
|
|
||||||
| Aspect | Current (PyTorch) | Proposed (Zig) |
|
|
||||||
|--------|------------------|----------------|
|
|
||||||
| Cold start | 10-30s | **< 2s** |
|
|
||||||
| Memory usage | 20-40GB | **< 16GB** |
|
|
||||||
| Dependencies | ~2GB runtime | **Single binary** |
|
|
||||||
| Deployment | Complex | **Copy & run** |
|
|
||||||
|
|
||||||
## Technical Challenges
|
## Technical Challenges
|
||||||
|
|
||||||
- **Model Complexity**: DeepSeek V3's MoE architecture requires careful memory management
|
- **Model Complexity**: DeepSeek V3's MoE architecture requires careful memory management
|
||||||
- **Backend Integration**: Need efficient FFI to CUDA/Metal while maintaining performance
|
- **Backend Integration**: Need efficient FFI to CUDA/Metal while maintaining performance
|
||||||
- **Web Scale**: Handle concurrent requests without blocking inference
|
- **Web Scale**: Handle concurrent requests without blocking inference
|
||||||
- **Accuracy**: Match PyTorch numerical precision
|
- **Accuracy**: Match PyTorch numerical precision
|
||||||
|
- **Performance**: Current implementation is 1000x slower than optimised BLAS - major optimization needed
|
||||||
|
|
||||||
## Platform-Specific Opportunities
|
## Platform-Specific Opportunities
|
||||||
|
|
||||||
### Apple Silicon (M-Series)
|
### Apple Silicon (M-Series) ✅ **Draft Detection Implemented**
|
||||||
- **Metal Performance Shaders** integration for matrix operations
|
- **Metal Performance Shaders** integration for matrix operations
|
||||||
- **AMX instruction set** access for accelerated linear algebra
|
- **AMX instruction set** access for accelerated linear algebra
|
||||||
- **Unified memory architecture** exploitation for zero-copy transfers
|
- **Unified memory architecture** exploitation for zero-copy transfers
|
||||||
- **Power efficiency tuning** across P and E cores
|
- **Power efficiency tuning** across P and E cores
|
||||||
|
- **✅ Proper M1/M2/M3/M4 detection** via system calls
|
||||||
|
|
||||||
|
*Current status: Hardware detection working, GPU acceleration not yet implemented.*
|
||||||
|
|
||||||
### x86_64 Architecture
|
### x86_64 Architecture
|
||||||
- **AVX-512 vectorization** with masked operations
|
- **AVX-512 vectorization** with masked operations
|
||||||
@ -137,39 +152,29 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
|||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
|
|
||||||
**Current Status**: This repository contains the original Python DeepSeek V3 implementation. The Zig implementation is proposed future work.
|
**Current Status**: This repository contains a **DRAFT EXPERIMENTAL** Zig implementation foundation.
|
||||||
|
|
||||||
### For the Current Python Implementation:
|
### For the Current Zig Implementation:
|
||||||
```bash
|
```bash
|
||||||
# Clone this repository
|
# Clone this repository
|
||||||
git clone https://github.com/[current-repo-path]
|
git clone https://github.com/Triex/DeepZig-V3
|
||||||
cd DeepSeek-V3-Zig
|
cd DeepSeek-V3-Zig/experimental
|
||||||
|
|
||||||
# Follow existing Python setup instructions
|
# Build and test the foundation
|
||||||
# (see original DeepSeek V3 documentation)
|
zig build
|
||||||
```
|
|
||||||
|
|
||||||
### For the Proposed Zig Implementation:
|
# Run the HTTP server (basic structure)
|
||||||
```bash
|
zig build run -- --port 8080
|
||||||
# This would be the future workflow once implemented:
|
|
||||||
|
|
||||||
# 1. Set up new Zig project structure
|
# Run benchmarks (see actual performance)
|
||||||
zig init-exe deepseek-v3-zig
|
|
||||||
|
|
||||||
# 2. Implement core components
|
|
||||||
# - Tensor operations with SIMD
|
|
||||||
# - HTTP server framework
|
|
||||||
# - Model architecture
|
|
||||||
|
|
||||||
# 3. Test and benchmark
|
|
||||||
zig build test
|
|
||||||
zig build bench
|
zig build bench
|
||||||
|
|
||||||
# 4. Run web server
|
# Test Apple Silicon detection
|
||||||
zig build run -- --port 8080
|
zig build-exe src/test_m_series.zig -I src -lc -framework Metal -framework Foundation
|
||||||
|
./test_m_series
|
||||||
```
|
```
|
||||||
|
|
||||||
**Want to contribute to making this real?** See [Seeking Contributors](#seeking-contributors) below.
|
**📊 Performance Reality Check**: See [experimental/README.md](experimental/README.md) for actual benchmark results showing current performance limitations and optimisation opportunities.
|
||||||
|
|
||||||
## Development Approach
|
## Development Approach
|
||||||
|
|
||||||
@ -183,38 +188,27 @@ Reference: [Zig Cookbook](https://zigcc.github.io/zig-cookbook/) for implementat
|
|||||||
|
|
||||||
## Seeking Contributors
|
## Seeking Contributors
|
||||||
|
|
||||||
This is an ambitious project that would benefit from expertise in:
|
This is an ambitious **DRAFT project** that would benefit from expertise in:
|
||||||
|
- **Performance optimization** (current bottleneck: naive matrix operations)
|
||||||
- **Zig systems programming**
|
- **Zig systems programming**
|
||||||
- **GPU kernel optimization** (CUDA/Metal)
|
- **GPU kernel optimization** (CUDA/Metal)
|
||||||
- **ML model implementation**
|
- **ML model implementation**
|
||||||
- **Web server development**
|
- **Web server development**
|
||||||
- **Performance optimization**
|
|
||||||
- **Hardware-software co-design**
|
- **Hardware-software co-design**
|
||||||
- **Novel inference techniques** (Speculative decoding, quantization)
|
- **Novel inference techniques** (Speculative decoding, quantization)
|
||||||
|
|
||||||
## Project Timeline
|
## Current Limitations & Next Steps
|
||||||
|
|
||||||
- Foundation and basic tensor ops
|
**🚧 What's Working**: Compiles, runs, measures performance
|
||||||
- Core transformer implementation
|
**⚠️ What's Missing**: Optimized algorithms, robust flows, actual DeepSeek V3 model
|
||||||
- Backend optimization and web API
|
**📊 Performance Gap**: 1000x slower than production systems
|
||||||
- Testing, benchmarking, deployment tools
|
**🎯 Next Priority**: BLAS integration and GPU acceleration
|
||||||
|
|
||||||
## Key Questions
|
See [experimental implementation](experimental/) for technical details and current benchmarks.
|
||||||
|
|
||||||
**Q: Why not just optimize PyTorch?**
|
|
||||||
A: PyTorch's Python overhead and GC pauses are fundamental limitations. Zig offers zero-cost abstractions, superior error handling, and deterministic performance.
|
|
||||||
|
|
||||||
**Q: How will this compare to llama.cpp?**
|
|
||||||
A: Similar performance goals, but with built-in web API, better memory management, and focus on DeepSeek V3's specific MoE architecture.
|
|
||||||
|
|
||||||
**Q: What about ONNX/TensorRT/ZML etc?**
|
|
||||||
A: Those are inference runtimes, not development frameworks / LLM frameworks. This project enables rapid iteration and custom optimization for research.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
- [DeepZig V3 (Experimental Start)](https://github.com/Triex/DeepZig-V3/tree/main/experimental)
|
- [DeepZig V3 (Experimental Implementation)](experimental/) - **Current working code**
|
||||||
- [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437) - Original model architecture
|
- [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437) - Original model architecture
|
||||||
- [Zig Language](https://ziglang.org/) - Language documentation
|
- [Zig Language](https://ziglang.org/) - Language documentation
|
||||||
- [Awesome Zig](https://github.com/C-BJ/awesome-zig) - Community resources
|
- [Awesome Zig](https://github.com/C-BJ/awesome-zig) - Community resources
|
||||||
@ -225,5 +219,7 @@ A: Those are inference runtimes, not development frameworks / LLM frameworks. Th
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Status**: 🎯 Seeking feedback & idea expansion<br/>
|
**Status**: 🎯 **EXPERIMENTAL DRAFT** - Foundation compiles and runs basic operations ([see benchmarks](experimental/README.md#benchmarks))<br/>
|
||||||
**Vision**: Foundation for advanced AI reasoning research
|
**Vision**: Foundation for advanced AI reasoning research
|
||||||
|
|
||||||
|
**⚠️ Important**: This is a **research/development foundation** with draft/base implementations. Not ready for production use.
|
||||||
|
@ -9,6 +9,7 @@ A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/)
|
|||||||
> - ✅ **SIMD-optimized tensor operations** (AVX2, NEON)
|
> - ✅ **SIMD-optimized tensor operations** (AVX2, NEON)
|
||||||
> - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
|
> - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
|
||||||
> - ✅ **Memory management** and backend architecture
|
> - ✅ **Memory management** and backend architecture
|
||||||
|
> - ✅ **Apple Silicon detection via sysctl calls**
|
||||||
>
|
>
|
||||||
> **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.<br/>
|
> **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.<br/>
|
||||||
> **Performance Note**: Current implementation uses naive algorithms - matrix multiplication is ~1000x slower than optimized BLAS. See [benchmarks](#benchmarks) below.<br/>
|
> **Performance Note**: Current implementation uses naive algorithms - matrix multiplication is ~1000x slower than optimized BLAS. See [benchmarks](#benchmarks) below.<br/>
|
||||||
@ -25,6 +26,8 @@ This experimental implementation aims to leverage Zig's unique advantages for sy
|
|||||||
- **Single binary deployment** with no runtime dependencies
|
- **Single binary deployment** with no runtime dependencies
|
||||||
- **Cross-platform compilation** for multiple architectures
|
- **Cross-platform compilation** for multiple architectures
|
||||||
|
|
||||||
|
**🔗 Related**: See the [main project README](../README.md) for architecture overview and vision.
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -230,20 +233,26 @@ Run benchmarks to measure performance:
|
|||||||
zig build bench
|
zig build bench
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Hardware Context**: Benchmarks run on Apple M1 MacBook Pro (MacBookPro17,1) with 16GB unified memory, Zig 0.15.0-dev.703+597dd328e, debug build.
|
||||||
|
|
||||||
Example output:
|
Example output:
|
||||||
```
|
```
|
||||||
🚀 DeepZig V3 Performance Benchmarks
|
🚀 DeepZig V3 Performance Benchmarks
|
||||||
==========================================
|
==========================================
|
||||||
|
|
||||||
Backend: CPU (SIMD optimized)
|
Backend: CPU (SIMD optimized)
|
||||||
Architecture: x86_64
|
Architecture: aarch64
|
||||||
Thread count: 16
|
Thread count: 8
|
||||||
|
Hardware: Apple M1 MacBook Pro, 16GB unified memory
|
||||||
|
|
||||||
Operation | Iterations | Avg Time | Operations/s | Memory
|
Operation | Iterations | Avg Time | Operations/s | Memory
|
||||||
-------------------------------|------------|-----------|--------------|-------
|
-------------------------------|------------|-----------|--------------|-------
|
||||||
Tensor Creation (1024x1024) | 1000 iter | 2.03 ms | 493 ops/s | 4.0 MB
|
Tensor Creation (1024x1024) | 1000 iter | 2.03 ms | 493 ops/s | 4.0 MB
|
||||||
Tensor Addition (SIMD) | 100 iter | 1.49 ms | 2806962690 ops/s | 48.0 MB
|
Tensor Addition (SIMD) | 100 iter | 1.49 ms | 2806962690 ops/s | 48.0 MB
|
||||||
Matrix Multiplication | 10 iter | 6418.08 ms | 0 GFLOPS | 12.0 MB
|
Matrix Multiplication | 10 iter | 6418.08 ms | 0 GFLOPS | 12.0 MB
|
||||||
|
SwiGLU Activation | 1000 iter | 4.44 ms | 236002478 ops/s | 12.0 MB
|
||||||
|
RMS Normalization (SIMD) | 1000 iter | 0.00 ms | 1077586 ops/s | 0.0 MB
|
||||||
|
Memory Bandwidth | 100 iter | 4.92 ms | 13 ops/s | 128.0 MB
|
||||||
```
|
```
|
||||||
|
|
||||||
## Known Issues
|
## Known Issues
|
||||||
|
306
experimental/src/backends/metal/device.zig
Normal file
306
experimental/src/backends/metal/device.zig
Normal file
@ -0,0 +1,306 @@
|
|||||||
|
// Metal Device detection and handling for Apple Silicon
|
||||||
|
// Specifically optimized for M-series chips using proper system detection
|
||||||
|
|
||||||
|
const std = @import("std");
|
||||||
|
const Allocator = std.mem.Allocator;
|
||||||
|
const c = std.c;
|
||||||
|
|
||||||
|
// Device information structure
|
||||||
|
pub const MetalDeviceInfo = struct {
|
||||||
|
device_name: []const u8,
|
||||||
|
is_apple_silicon: bool,
|
||||||
|
is_m_series: bool,
|
||||||
|
series_generation: u8, // 1 = M1, 2 = M2, 3 = M3, etc.
|
||||||
|
variant: []const u8, // "Pro", "Max", "Ultra", etc.
|
||||||
|
unified_memory_size: u64,
|
||||||
|
has_anc: bool, // Apple Neural Engine
|
||||||
|
|
||||||
|
pub fn format(
|
||||||
|
self: @This(),
|
||||||
|
comptime fmt: []const u8,
|
||||||
|
options: std.fmt.FormatOptions,
|
||||||
|
writer: anytype,
|
||||||
|
) !void {
|
||||||
|
_ = fmt;
|
||||||
|
_ = options;
|
||||||
|
try writer.print("Metal Device: {s} ({s}{d} {s})", .{
|
||||||
|
self.device_name,
|
||||||
|
if (self.is_m_series) "M" else "",
|
||||||
|
if (self.is_m_series) self.series_generation else 0,
|
||||||
|
if (self.is_m_series) self.variant else "",
|
||||||
|
});
|
||||||
|
try writer.print("\nUnified Memory: {} GB", .{self.unified_memory_size / (1024 * 1024 * 1024)});
|
||||||
|
try writer.print("\nApple Neural Engine: {}", .{if (self.has_anc) "Available" else "Not Available"});
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
// M-series chip information
|
||||||
|
const MSeriesInfo = struct {
|
||||||
|
is_m_series: bool,
|
||||||
|
generation: u8,
|
||||||
|
variant: []const u8,
|
||||||
|
};
|
||||||
|
|
||||||
|
// System detection using sysctl
|
||||||
|
const SysctlError = error{
|
||||||
|
NotFound,
|
||||||
|
BufferTooSmall,
|
||||||
|
SystemError,
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Get sysctl string value
|
||||||
|
fn getSysctlString(allocator: Allocator, name: []const u8) ![]const u8 {
|
||||||
|
// Only available on macOS
|
||||||
|
if (@import("builtin").os.tag != .macos) {
|
||||||
|
return SysctlError.NotFound;
|
||||||
|
}
|
||||||
|
|
||||||
|
var size: usize = 0;
|
||||||
|
|
||||||
|
// First, get the size needed
|
||||||
|
const name_cstr = try allocator.dupeZ(u8, name);
|
||||||
|
defer allocator.free(name_cstr);
|
||||||
|
|
||||||
|
if (c.sysctlbyname(name_cstr.ptr, null, &size, null, 0) != 0) {
|
||||||
|
return SysctlError.NotFound;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Allocate buffer and get the actual value
|
||||||
|
const buffer = try allocator.alloc(u8, size);
|
||||||
|
defer allocator.free(buffer);
|
||||||
|
|
||||||
|
if (c.sysctlbyname(name_cstr.ptr, buffer.ptr, &size, null, 0) != 0) {
|
||||||
|
return SysctlError.SystemError;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Return a copy of the string (minus null terminator if present)
|
||||||
|
const len = if (size > 0 and buffer[size - 1] == 0) size - 1 else size;
|
||||||
|
return try allocator.dupe(u8, buffer[0..len]);
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get sysctl integer value
|
||||||
|
fn getSysctlInt(comptime T: type, name: []const u8, allocator: Allocator) !T {
|
||||||
|
if (@import("builtin").os.tag != .macos) {
|
||||||
|
return SysctlError.NotFound;
|
||||||
|
}
|
||||||
|
|
||||||
|
var value: T = 0;
|
||||||
|
var size: usize = @sizeOf(T);
|
||||||
|
|
||||||
|
const name_cstr = try allocator.dupeZ(u8, name);
|
||||||
|
defer allocator.free(name_cstr);
|
||||||
|
|
||||||
|
if (c.sysctlbyname(name_cstr.ptr, &value, &size, null, 0) != 0) {
|
||||||
|
return SysctlError.NotFound;
|
||||||
|
}
|
||||||
|
|
||||||
|
return value;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Check if running under Rosetta 2 translation
|
||||||
|
fn isRunningUnderRosetta(allocator: Allocator) bool {
|
||||||
|
const result = getSysctlInt(i32, "sysctl.proc_translated", allocator) catch return false;
|
||||||
|
return result == 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Check if hardware supports ARM64 (Apple Silicon)
|
||||||
|
fn isAppleSiliconHardware(allocator: Allocator) bool {
|
||||||
|
// Check for ARM64 support
|
||||||
|
const arm64_support = getSysctlInt(i32, "hw.optional.arm64", allocator) catch return false;
|
||||||
|
if (arm64_support == 1) return true;
|
||||||
|
|
||||||
|
// Alternative check: CPU architecture
|
||||||
|
if (@import("builtin").target.cpu.arch == .aarch64) return true;
|
||||||
|
|
||||||
|
// If running under Rosetta, we're on Apple Silicon
|
||||||
|
return isRunningUnderRosetta(allocator);
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Parse M-series information from CPU brand string
|
||||||
|
fn parseMSeriesInfo(cpu_brand: []const u8) MSeriesInfo {
|
||||||
|
// Default values
|
||||||
|
var result = MSeriesInfo{
|
||||||
|
.is_m_series = false,
|
||||||
|
.generation = 0,
|
||||||
|
.variant = "",
|
||||||
|
};
|
||||||
|
|
||||||
|
// Look for Apple M pattern
|
||||||
|
if (std.mem.indexOf(u8, cpu_brand, "Apple M") == null) {
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
result.is_m_series = true;
|
||||||
|
|
||||||
|
// Extract generation and variant from CPU brand string
|
||||||
|
// Examples: "Apple M1", "Apple M1 Pro", "Apple M1 Max", "Apple M1 Ultra"
|
||||||
|
if (std.mem.indexOf(u8, cpu_brand, "M1")) |_| {
|
||||||
|
result.generation = 1;
|
||||||
|
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||||
|
result.variant = "Pro";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||||
|
result.variant = "Max";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||||
|
result.variant = "Ultra";
|
||||||
|
} else {
|
||||||
|
// Just "Apple M1" - this is the regular M1
|
||||||
|
result.variant = "";
|
||||||
|
}
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, "M2")) |_| {
|
||||||
|
result.generation = 2;
|
||||||
|
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||||
|
result.variant = "Pro";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||||
|
result.variant = "Max";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||||
|
result.variant = "Ultra";
|
||||||
|
} else {
|
||||||
|
result.variant = "";
|
||||||
|
}
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, "M3")) |_| {
|
||||||
|
result.generation = 3;
|
||||||
|
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||||
|
result.variant = "Pro";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||||
|
result.variant = "Max";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||||
|
result.variant = "Ultra";
|
||||||
|
} else {
|
||||||
|
result.variant = "";
|
||||||
|
}
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, "M4")) |_| {
|
||||||
|
result.generation = 4;
|
||||||
|
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||||
|
result.variant = "Pro";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||||
|
result.variant = "Max";
|
||||||
|
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||||
|
result.variant = "Ultra";
|
||||||
|
} else {
|
||||||
|
result.variant = "";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Try to detect GPU configuration for more detailed chip identification
|
||||||
|
fn detectGPUCores(allocator: Allocator) u32 {
|
||||||
|
// Try to get GPU core count - this can help distinguish variants
|
||||||
|
// Regular M1: 7-8 GPU cores
|
||||||
|
// M1 Pro: 14-16 GPU cores
|
||||||
|
// M1 Max: 24-32 GPU cores
|
||||||
|
|
||||||
|
// This is a placeholder - actual implementation would query Metal API
|
||||||
|
// For now, return 0 to indicate unknown
|
||||||
|
_ = allocator;
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Detect Apple Silicon and M-series chip capabilities using proper system detection
|
||||||
|
pub fn detectAppleSilicon(allocator: Allocator) !MetalDeviceInfo {
|
||||||
|
// Check at compile-time if we're on macOS
|
||||||
|
const is_macos = @import("builtin").os.tag == .macos;
|
||||||
|
if (!is_macos) {
|
||||||
|
return MetalDeviceInfo{
|
||||||
|
.device_name = try allocator.dupe(u8, "Non-macOS Device"),
|
||||||
|
.is_apple_silicon = false,
|
||||||
|
.is_m_series = false,
|
||||||
|
.series_generation = 0,
|
||||||
|
.variant = try allocator.dupe(u8, ""),
|
||||||
|
.unified_memory_size = 0,
|
||||||
|
.has_anc = false,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Detect Apple Silicon hardware
|
||||||
|
const is_apple_silicon = isAppleSiliconHardware(allocator);
|
||||||
|
if (!is_apple_silicon) {
|
||||||
|
return MetalDeviceInfo{
|
||||||
|
.device_name = try allocator.dupe(u8, "Intel Mac"),
|
||||||
|
.is_apple_silicon = false,
|
||||||
|
.is_m_series = false,
|
||||||
|
.series_generation = 0,
|
||||||
|
.variant = try allocator.dupe(u8, ""),
|
||||||
|
.unified_memory_size = 0,
|
||||||
|
.has_anc = false,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// Get CPU brand string for M-series detection - this is the authoritative source
|
||||||
|
const cpu_brand = getSysctlString(allocator, "machdep.cpu.brand_string") catch "Apple Silicon";
|
||||||
|
defer allocator.free(cpu_brand);
|
||||||
|
|
||||||
|
std.log.debug("CPU Brand String: '{s}'", .{cpu_brand});
|
||||||
|
|
||||||
|
// Parse M-series information from the actual CPU brand string
|
||||||
|
const m_info = parseMSeriesInfo(cpu_brand);
|
||||||
|
|
||||||
|
// Get additional hardware details for logging/debugging
|
||||||
|
const hw_model = getSysctlString(allocator, "hw.model") catch "";
|
||||||
|
defer if (hw_model.len > 0) allocator.free(hw_model);
|
||||||
|
|
||||||
|
const gpu_cores = detectGPUCores(allocator);
|
||||||
|
if (gpu_cores > 0) {
|
||||||
|
std.log.debug("GPU Cores: {}", .{gpu_cores});
|
||||||
|
}
|
||||||
|
|
||||||
|
std.log.debug("Hardware Model: '{s}'", .{hw_model});
|
||||||
|
std.log.debug("Detected M{d} {s}", .{ m_info.generation, m_info.variant });
|
||||||
|
|
||||||
|
// Get system memory
|
||||||
|
const memory_size = getSysctlInt(u64, "hw.memsize", allocator) catch (16 * 1024 * 1024 * 1024); // Default 16GB
|
||||||
|
|
||||||
|
// Get device name
|
||||||
|
const device_name = getSysctlString(allocator, "hw.model") catch "Apple Silicon Mac";
|
||||||
|
|
||||||
|
return MetalDeviceInfo{
|
||||||
|
.device_name = device_name, // This will be owned by the caller
|
||||||
|
.is_apple_silicon = true,
|
||||||
|
.is_m_series = m_info.is_m_series,
|
||||||
|
.series_generation = m_info.generation,
|
||||||
|
.variant = try allocator.dupe(u8, m_info.variant), // Duplicate to ensure consistent allocation
|
||||||
|
.unified_memory_size = memory_size,
|
||||||
|
.has_anc = m_info.is_m_series, // All M-series have Apple Neural Engine
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get optimal GPU parameters for detected device
|
||||||
|
pub fn getOptimalWorkGroupSize() u32 {
|
||||||
|
// These are reasonable defaults that should work well on most Apple GPU architectures
|
||||||
|
// In a real implementation, we would query Metal API for the actual optimal values
|
||||||
|
if (@import("builtin").target.cpu.arch == .aarch64) {
|
||||||
|
// Apple Silicon optimized values based on GPU core count
|
||||||
|
return 128;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Default for Intel Macs and others
|
||||||
|
return 64;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get recommended memory allocation strategy based on device capabilities
|
||||||
|
pub fn getMemoryStrategy() enum { UnifiedMemory, DiscreteMemory } {
|
||||||
|
// Check if we're on Apple Silicon hardware (even under Rosetta)
|
||||||
|
if (@import("builtin").os.tag == .macos) {
|
||||||
|
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
|
||||||
|
defer _ = gpa.deinit();
|
||||||
|
const allocator = gpa.allocator();
|
||||||
|
|
||||||
|
if (isAppleSiliconHardware(allocator)) {
|
||||||
|
return .UnifiedMemory; // Apple Silicon uses unified memory
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// For Intel Macs and other platforms
|
||||||
|
return .DiscreteMemory;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get optimal tensor block size for current device
|
||||||
|
pub fn getOptimalTensorBlockSize() u32 {
|
||||||
|
if (@import("builtin").target.cpu.arch == .aarch64) {
|
||||||
|
// Apple Silicon has more GPU cores and benefits from larger blocks
|
||||||
|
return 256;
|
||||||
|
} else {
|
||||||
|
return 128;
|
||||||
|
}
|
||||||
|
}
|
152
experimental/src/backends/metal/memory.zig
Normal file
152
experimental/src/backends/metal/memory.zig
Normal file
@ -0,0 +1,152 @@
|
|||||||
|
// Metal-specific memory management for Apple Silicon
|
||||||
|
// Optimized for the unified memory architecture of M-series chips
|
||||||
|
|
||||||
|
const std = @import("std");
|
||||||
|
const Allocator = std.mem.Allocator;
|
||||||
|
const device = @import("device.zig");
|
||||||
|
const MetalDeviceInfo = device.MetalDeviceInfo;
|
||||||
|
|
||||||
|
/// Memory modes available for Metal buffers
|
||||||
|
pub const MetalMemoryMode = enum {
|
||||||
|
/// Shared between CPU and GPU with automatic migration
|
||||||
|
Shared,
|
||||||
|
|
||||||
|
/// Managed with separate CPU and GPU views but synchronized
|
||||||
|
Managed,
|
||||||
|
|
||||||
|
/// GPU-only storage for maximum performance
|
||||||
|
Private,
|
||||||
|
|
||||||
|
/// Memory visible to both CPU and GPU (Apple Silicon only)
|
||||||
|
Unified,
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Buffer usage patterns to optimize memory allocation
|
||||||
|
pub const MetalBufferUsage = enum {
|
||||||
|
/// Read often by GPU
|
||||||
|
GpuRead,
|
||||||
|
|
||||||
|
/// Write often by GPU
|
||||||
|
GpuWrite,
|
||||||
|
|
||||||
|
/// Read/write by both CPU and GPU
|
||||||
|
Shared,
|
||||||
|
|
||||||
|
/// Used only temporarily for a single operation
|
||||||
|
Transient,
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Memory manager for optimal Metal buffer allocation on M-series chips
|
||||||
|
pub const MetalMemoryManager = struct {
|
||||||
|
allocator: Allocator,
|
||||||
|
device_info: ?MetalDeviceInfo,
|
||||||
|
total_allocated: usize,
|
||||||
|
max_allocation: usize,
|
||||||
|
|
||||||
|
const Self = @This();
|
||||||
|
|
||||||
|
/// Create a new Metal memory manager
|
||||||
|
pub fn init(allocator: Allocator, device_info: ?MetalDeviceInfo) Self {
|
||||||
|
return Self{
|
||||||
|
.allocator = allocator,
|
||||||
|
.device_info = device_info,
|
||||||
|
.total_allocated = 0,
|
||||||
|
.max_allocation = 0,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Clean up any resources
|
||||||
|
pub fn deinit(self: *Self) void {
|
||||||
|
// Release any cached buffers or other resources
|
||||||
|
_ = self;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get the optimal memory mode based on device capabilities and usage pattern
|
||||||
|
pub fn getOptimalMemoryMode(self: *Self, usage: MetalBufferUsage) MetalMemoryMode {
|
||||||
|
// If we're on Apple Silicon, we can use unified memory
|
||||||
|
const is_apple_silicon = self.device_info != null and self.device_info.?.is_apple_silicon;
|
||||||
|
|
||||||
|
if (is_apple_silicon) {
|
||||||
|
return switch (usage) {
|
||||||
|
.GpuRead => .Unified,
|
||||||
|
.GpuWrite => .Unified,
|
||||||
|
.Shared => .Unified,
|
||||||
|
.Transient => .Private, // Even on unified memory, transient data is better in private
|
||||||
|
};
|
||||||
|
} else {
|
||||||
|
// On Intel Macs with discrete GPU
|
||||||
|
return switch (usage) {
|
||||||
|
.GpuRead => .Managed,
|
||||||
|
.GpuWrite => .Private,
|
||||||
|
.Shared => .Managed,
|
||||||
|
.Transient => .Private,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get recommended allocation size (aligned to device preferences)
|
||||||
|
pub fn getOptimalAllocationSize(self: *Self, requested_size: usize) usize {
|
||||||
|
// M-series chips prefer certain memory alignment patterns
|
||||||
|
const alignment: usize = if (self.device_info != null and self.device_info.?.is_m_series)
|
||||||
|
16 * 1024 // 16KB alignment on M-series
|
||||||
|
else
|
||||||
|
4 * 1024; // 4KB on other devices
|
||||||
|
|
||||||
|
return std.mem.alignForward(usize, requested_size, alignment);
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Track memory allocations for monitoring
|
||||||
|
pub fn trackAllocation(self: *Self, size: usize) void {
|
||||||
|
self.total_allocated += size;
|
||||||
|
self.max_allocation = std.math.max(self.max_allocation, self.total_allocated);
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Track memory deallocations
|
||||||
|
pub fn trackDeallocation(self: *Self, size: usize) void {
|
||||||
|
if (self.total_allocated >= size) {
|
||||||
|
self.total_allocated -= size;
|
||||||
|
} else {
|
||||||
|
self.total_allocated = 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get memory usage statistics
|
||||||
|
pub fn getMemoryStats(self: *Self) struct {
|
||||||
|
current: usize,
|
||||||
|
peak: usize,
|
||||||
|
device_total: usize,
|
||||||
|
} {
|
||||||
|
const device_total = if (self.device_info != null)
|
||||||
|
self.device_info.?.unified_memory_size
|
||||||
|
else
|
||||||
|
0;
|
||||||
|
|
||||||
|
return .{
|
||||||
|
.current = self.total_allocated,
|
||||||
|
.peak = self.max_allocation,
|
||||||
|
.device_total = device_total,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get recommended buffer storage mode string for Metal API
|
||||||
|
pub fn getStorageModeString(mode: MetalMemoryMode) []const u8 {
|
||||||
|
return switch (mode) {
|
||||||
|
.Shared => "MTLStorageModeShared",
|
||||||
|
.Managed => "MTLStorageModeManaged",
|
||||||
|
.Private => "MTLStorageModePrivate",
|
||||||
|
.Unified => "MTLStorageModeShared", // Unified uses Shared on the API level
|
||||||
|
};
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Helper to determine if hazard tracking should be enabled based on device capabilities
|
||||||
|
pub fn shouldUseHazardTracking(device_info: ?MetalDeviceInfo) bool {
|
||||||
|
if (device_info == null) return false;
|
||||||
|
|
||||||
|
// M3 and newer have better hazard tracking hardware
|
||||||
|
if (device_info.?.is_m_series and device_info.?.series_generation >= 3) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
@ -4,12 +4,18 @@
|
|||||||
const std = @import("std");
|
const std = @import("std");
|
||||||
const deepseek_core = @import("deepseek_core");
|
const deepseek_core = @import("deepseek_core");
|
||||||
const Allocator = std.mem.Allocator;
|
const Allocator = std.mem.Allocator;
|
||||||
|
const metal_device = @import("device.zig");
|
||||||
|
const MetalDeviceInfo = metal_device.MetalDeviceInfo;
|
||||||
|
|
||||||
/// Metal backend implementation for Apple Silicon
|
/// Metal backend implementation for Apple Silicon
|
||||||
pub const MetalBackend = struct {
|
pub const MetalBackend = struct {
|
||||||
allocator: Allocator,
|
allocator: Allocator,
|
||||||
device_available: bool,
|
device_available: bool,
|
||||||
unified_memory_size: u64,
|
unified_memory_size: u64,
|
||||||
|
device_info: ?MetalDeviceInfo,
|
||||||
|
optimal_work_group_size: u32,
|
||||||
|
memory_strategy: metal_device.getMemoryStrategy(),
|
||||||
|
tensor_block_size: u32,
|
||||||
|
|
||||||
const Self = @This();
|
const Self = @This();
|
||||||
|
|
||||||
@ -17,10 +23,37 @@ pub const MetalBackend = struct {
|
|||||||
// Check if Metal is available (compile-time check for macOS)
|
// Check if Metal is available (compile-time check for macOS)
|
||||||
const metal_available = @import("builtin").os.tag == .macos;
|
const metal_available = @import("builtin").os.tag == .macos;
|
||||||
|
|
||||||
|
var device_info: ?MetalDeviceInfo = null;
|
||||||
|
var unified_memory_size: u64 = 0;
|
||||||
|
var optimal_work_group_size: u32 = 64; // Default
|
||||||
|
var tensor_block_size: u32 = 128; // Default
|
||||||
|
|
||||||
if (metal_available) {
|
if (metal_available) {
|
||||||
std.log.info("Metal Backend initialized on Apple Silicon");
|
// Detect Apple Silicon and M-series capabilities
|
||||||
// TODO: Initialize MTLDevice and command queue
|
device_info = try metal_device.detectAppleSilicon(allocator);
|
||||||
// TODO: Query unified memory size
|
unified_memory_size = device_info.?.unified_memory_size;
|
||||||
|
optimal_work_group_size = metal_device.getOptimalWorkGroupSize();
|
||||||
|
tensor_block_size = metal_device.getOptimalTensorBlockSize();
|
||||||
|
|
||||||
|
std.log.info("Metal Backend initialized on {s}", .{device_info.?.device_name});
|
||||||
|
// Log detailed device information
|
||||||
|
if (device_info.?.is_apple_silicon) {
|
||||||
|
if (device_info.?.is_m_series) {
|
||||||
|
std.log.info("Detected M{d} {s} with {d}GB unified memory",
|
||||||
|
.{
|
||||||
|
device_info.?.series_generation,
|
||||||
|
device_info.?.variant,
|
||||||
|
unified_memory_size / (1024 * 1024 * 1024),
|
||||||
|
}
|
||||||
|
);
|
||||||
|
} else {
|
||||||
|
std.log.info("Detected Apple Silicon (non-M series) with {d}GB unified memory",
|
||||||
|
.{unified_memory_size / (1024 * 1024 * 1024)}
|
||||||
|
);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
std.log.warn("Metal is available but not running on Apple Silicon");
|
||||||
|
}
|
||||||
} else {
|
} else {
|
||||||
std.log.warn("Metal Backend not available on this platform");
|
std.log.warn("Metal Backend not available on this platform");
|
||||||
}
|
}
|
||||||
@ -28,7 +61,11 @@ pub const MetalBackend = struct {
|
|||||||
return Self{
|
return Self{
|
||||||
.allocator = allocator,
|
.allocator = allocator,
|
||||||
.device_available = metal_available,
|
.device_available = metal_available,
|
||||||
.unified_memory_size = if (metal_available) 16 * 1024 * 1024 * 1024 else 0, // 16GB default
|
.unified_memory_size = unified_memory_size,
|
||||||
|
.device_info = device_info,
|
||||||
|
.optimal_work_group_size = optimal_work_group_size,
|
||||||
|
.memory_strategy = metal_device.getMemoryStrategy(),
|
||||||
|
.tensor_block_size = tensor_block_size,
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -54,14 +91,93 @@ pub const MetalBackend = struct {
|
|||||||
c.shape.dims[0], c.shape.dims[1]
|
c.shape.dims[0], c.shape.dims[1]
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// Check if we're on Apple Silicon M series for optimized path
|
||||||
|
if (self.device_info != null and self.device_info.?.is_m_series) {
|
||||||
|
std.log.debug("Using optimized M{d} {s} matrix multiplication",
|
||||||
|
.{
|
||||||
|
self.device_info.?.series_generation,
|
||||||
|
self.device_info.?.variant
|
||||||
|
}
|
||||||
|
);
|
||||||
|
|
||||||
|
// Select appropriate implementation based on M series generation
|
||||||
|
switch (self.device_info.?.series_generation) {
|
||||||
|
3 => return try self.matmulM3(a, b, c), // M3 optimized path
|
||||||
|
2 => return try self.matmulM2(a, b, c), // M2 optimized path
|
||||||
|
1 => return try self.matmulM1(a, b, c), // M1 optimized path
|
||||||
|
else => {} // Fall through to generic implementation
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// TODO: Implement actual Metal compute shader
|
// TODO: Implement actual Metal compute shader
|
||||||
// This would involve:
|
// This would involve:
|
||||||
// 1. Create MTLBuffer from tensor data
|
// 1. Create MTLBuffer from tensor data
|
||||||
// 2. Set up compute pipeline with matmul shader
|
// 2. Set up compute pipeline with matmul shader
|
||||||
// 3. Dispatch compute commands
|
// 3. Dispatch compute commands with optimized workgroup size based on device
|
||||||
// 4. Copy results back to tensor
|
// 4. Copy results back to tensor
|
||||||
|
|
||||||
// For now, fallback to CPU implementation
|
// For now, fallback to CPU implementation
|
||||||
|
std.log.warn("Falling back to CPU implementation, Metal not implemented");
|
||||||
|
return error.NotImplemented;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// M1-optimized matrix multiplication
|
||||||
|
fn matmulM1(
|
||||||
|
self: *Self,
|
||||||
|
a: *deepseek_core.Tensor,
|
||||||
|
b: *const deepseek_core.Tensor,
|
||||||
|
c: *deepseek_core.Tensor,
|
||||||
|
) !void {
|
||||||
|
_ = self;
|
||||||
|
_ = a;
|
||||||
|
_ = b;
|
||||||
|
_ = c;
|
||||||
|
|
||||||
|
// TODO: M1-specific optimizations
|
||||||
|
// - Use MPSMatrixMultiplication with M1-specific parameters
|
||||||
|
// - Optimize for 7/8 GPU cores typically found in M1
|
||||||
|
// - Account for unified memory bandwidth on M1
|
||||||
|
|
||||||
|
return error.NotImplemented;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// M2-optimized matrix multiplication
|
||||||
|
fn matmulM2(
|
||||||
|
self: *Self,
|
||||||
|
a: *deepseek_core.Tensor,
|
||||||
|
b: *const deepseek_core.Tensor,
|
||||||
|
c: *deepseek_core.Tensor,
|
||||||
|
) !void {
|
||||||
|
_ = self;
|
||||||
|
_ = a;
|
||||||
|
_ = b;
|
||||||
|
_ = c;
|
||||||
|
|
||||||
|
// TODO: M2-specific optimizations
|
||||||
|
// - Use MPSMatrixMultiplication with M2-specific parameters
|
||||||
|
// - Optimize for 8/10 GPU cores typically found in M2
|
||||||
|
// - Account for increased memory bandwidth on M2
|
||||||
|
|
||||||
|
return error.NotImplemented;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// M3-optimized matrix multiplication
|
||||||
|
fn matmulM3(
|
||||||
|
self: *Self,
|
||||||
|
a: *deepseek_core.Tensor,
|
||||||
|
b: *const deepseek_core.Tensor,
|
||||||
|
c: *deepseek_core.Tensor,
|
||||||
|
) !void {
|
||||||
|
_ = self;
|
||||||
|
_ = a;
|
||||||
|
_ = b;
|
||||||
|
_ = c;
|
||||||
|
|
||||||
|
// TODO: M3-specific optimizations
|
||||||
|
// - Use MPSMatrixMultiplication with M3-specific parameters
|
||||||
|
// - Optimize for 10/16 GPU cores typically found in M3
|
||||||
|
// - Account for dynamic core switching on M3
|
||||||
|
|
||||||
return error.NotImplemented;
|
return error.NotImplemented;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -77,16 +193,59 @@ pub const MetalBackend = struct {
|
|||||||
return error.MetalNotAvailable;
|
return error.MetalNotAvailable;
|
||||||
}
|
}
|
||||||
|
|
||||||
_ = input;
|
std.log.debug("Metal RMS normalization with {} elements", .{input.len});
|
||||||
|
|
||||||
|
// Check if we're on Apple Silicon M series for optimized path
|
||||||
|
if (self.device_info != null and self.device_info.?.is_m_series) {
|
||||||
|
std.log.debug("Using optimized M{d} {s} RMS normalization",
|
||||||
|
.{
|
||||||
|
self.device_info.?.series_generation,
|
||||||
|
self.device_info.?.variant
|
||||||
|
}
|
||||||
|
);
|
||||||
|
|
||||||
|
// Select optimal workgroup size based on M series generation
|
||||||
|
const workgroup_size = switch (self.device_info.?.series_generation) {
|
||||||
|
3 => 256, // M3 has more GPU cores
|
||||||
|
2 => 192, // M2 optimization
|
||||||
|
else => 128, // M1 and others
|
||||||
|
};
|
||||||
|
|
||||||
|
// Determine if we should use unified memory approach
|
||||||
|
const use_unified_memory = self.memory_strategy == .UnifiedMemory;
|
||||||
|
|
||||||
|
// Calculate optimal thread count based on input size and GPU cores
|
||||||
|
const thread_count = std.math.min(
|
||||||
|
std.math.alignForward(usize, input.len, workgroup_size),
|
||||||
|
workgroup_size * 1024 // Maximum reasonable thread count
|
||||||
|
);
|
||||||
|
|
||||||
|
std.log.debug("RMS Norm using workgroup size: {}, threads: {}",
|
||||||
|
.{workgroup_size, thread_count});
|
||||||
|
|
||||||
|
// TODO: Implement Metal compute shader for RMS norm with M-series optimizations
|
||||||
|
// 1. Create buffers (potentially using managed storage mode for unified memory)
|
||||||
|
// 2. Set up compute pipeline with RMS norm shader
|
||||||
|
// 3. Dispatch compute with optimal work group size
|
||||||
|
// 4. Handle results with zero-copy when possible on unified memory
|
||||||
|
|
||||||
|
if (!use_unified_memory) {
|
||||||
|
// Would handle non-unified memory path differently
|
||||||
|
std.log.debug("Using discrete memory path");
|
||||||
|
}
|
||||||
|
|
||||||
|
// thread_count is used in the log message above, don't discard it
|
||||||
|
}
|
||||||
|
|
||||||
|
// TODO: Complete implementation of Metal compute shader for RMS norm
|
||||||
|
// Metal excels at parallel operations like normalization
|
||||||
|
|
||||||
|
// Don't discard input since it's used above for thread_count calculation
|
||||||
|
// Only discard these if not used above
|
||||||
_ = weight;
|
_ = weight;
|
||||||
_ = output;
|
_ = output;
|
||||||
_ = eps;
|
_ = eps;
|
||||||
|
|
||||||
std.log.debug("Metal RMS normalization");
|
|
||||||
|
|
||||||
// TODO: Implement Metal compute shader for RMS norm
|
|
||||||
// Metal excels at parallel operations like normalization
|
|
||||||
|
|
||||||
return error.NotImplemented;
|
return error.NotImplemented;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
254
experimental/src/backends/metal/shader.zig
Normal file
254
experimental/src/backends/metal/shader.zig
Normal file
@ -0,0 +1,254 @@
|
|||||||
|
// Metal shader utility for managing and optimizing Metal shaders
|
||||||
|
// With specific optimizations for M-series Apple Silicon
|
||||||
|
|
||||||
|
const std = @import("std");
|
||||||
|
const Allocator = std.mem.Allocator;
|
||||||
|
const device = @import("device.zig");
|
||||||
|
const MetalDeviceInfo = device.MetalDeviceInfo;
|
||||||
|
|
||||||
|
/// Optimization level for Metal shaders
|
||||||
|
pub const ShaderOptimizationLevel = enum {
|
||||||
|
none,
|
||||||
|
default,
|
||||||
|
performance,
|
||||||
|
size,
|
||||||
|
|
||||||
|
/// Get the recommended optimization level based on device capabilities
|
||||||
|
pub fn fromDeviceInfo(device_info: ?MetalDeviceInfo) ShaderOptimizationLevel {
|
||||||
|
if (device_info == null) return .default;
|
||||||
|
|
||||||
|
if (device_info.?.is_m_series) {
|
||||||
|
// M3 can handle highly optimized shaders
|
||||||
|
if (device_info.?.series_generation >= 3) {
|
||||||
|
return .performance;
|
||||||
|
}
|
||||||
|
// M1/M2 balance between performance and size
|
||||||
|
else {
|
||||||
|
return .default;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// For non-Apple Silicon, be more conservative
|
||||||
|
return .default;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Metal shader types
|
||||||
|
pub const ShaderType = enum {
|
||||||
|
compute,
|
||||||
|
vertex,
|
||||||
|
fragment,
|
||||||
|
|
||||||
|
pub fn toMTLFunctionType(self: ShaderType) []const u8 {
|
||||||
|
return switch (self) {
|
||||||
|
.compute => "MTLFunctionTypeKernel",
|
||||||
|
.vertex => "MTLFunctionTypeVertex",
|
||||||
|
.fragment => "MTLFunctionTypeFragment",
|
||||||
|
};
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Metal shader source with metadata
|
||||||
|
pub const ShaderSource = struct {
|
||||||
|
name: []const u8,
|
||||||
|
source_code: []const u8,
|
||||||
|
shader_type: ShaderType,
|
||||||
|
|
||||||
|
/// Create a shader source with a given name and code
|
||||||
|
pub fn init(name: []const u8, source_code: []const u8, shader_type: ShaderType) ShaderSource {
|
||||||
|
return .{
|
||||||
|
.name = name,
|
||||||
|
.source_code = source_code,
|
||||||
|
.shader_type = shader_type,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Metal shader compilation options including M-series specific optimizations
|
||||||
|
pub const ShaderCompileOptions = struct {
|
||||||
|
optimization_level: ShaderOptimizationLevel,
|
||||||
|
fast_math: bool,
|
||||||
|
preserve_invariance: bool,
|
||||||
|
|
||||||
|
/// Create default options for a specific device
|
||||||
|
pub fn forDevice(device_info: ?MetalDeviceInfo) ShaderCompileOptions {
|
||||||
|
const opt_level = ShaderOptimizationLevel.fromDeviceInfo(device_info);
|
||||||
|
|
||||||
|
// M-series chips benefit from fast math but some algorithms require precision
|
||||||
|
const fast_math = device_info != null and
|
||||||
|
device_info.?.is_m_series and
|
||||||
|
device_info.?.series_generation >= 2;
|
||||||
|
|
||||||
|
return .{
|
||||||
|
.optimization_level = opt_level,
|
||||||
|
.fast_math = fast_math,
|
||||||
|
.preserve_invariance = false,
|
||||||
|
};
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Utility for managing Metal shader compilation and caching
|
||||||
|
pub const ShaderManager = struct {
|
||||||
|
allocator: Allocator,
|
||||||
|
device_info: ?MetalDeviceInfo,
|
||||||
|
compile_options: ShaderCompileOptions,
|
||||||
|
|
||||||
|
const Self = @This();
|
||||||
|
|
||||||
|
/// Create a new shader manager
|
||||||
|
pub fn init(
|
||||||
|
allocator: Allocator,
|
||||||
|
device_info: ?MetalDeviceInfo
|
||||||
|
) Self {
|
||||||
|
return Self{
|
||||||
|
.allocator = allocator,
|
||||||
|
.device_info = device_info,
|
||||||
|
.compile_options = ShaderCompileOptions.forDevice(device_info),
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Clean up resources
|
||||||
|
pub fn deinit(self: *Self) void {
|
||||||
|
_ = self;
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get optimal threadgroup size for a compute shader on current device
|
||||||
|
pub fn getOptimalThreadgroupSize(self: *Self) struct { x: u32, y: u32, z: u32 } {
|
||||||
|
if (self.device_info == null or !self.device_info.?.is_apple_silicon) {
|
||||||
|
return .{ .x = 8, .y = 8, .z = 1 };
|
||||||
|
}
|
||||||
|
|
||||||
|
// M-series chips have different optimal sizes
|
||||||
|
if (self.device_info.?.is_m_series) {
|
||||||
|
return switch (self.device_info.?.series_generation) {
|
||||||
|
3 => .{ .x = 16, .y = 16, .z = 1 }, // M3 has more GPU cores
|
||||||
|
2 => .{ .x = 16, .y = 8, .z = 1 }, // M2
|
||||||
|
else => .{ .x = 8, .y = 8, .z = 1 }, // M1
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
return .{ .x = 8, .y = 8, .z = 1 };
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Get memory barrier type based on hardware capabilities
|
||||||
|
pub fn getOptimalBarrierType(self: *Self) []const u8 {
|
||||||
|
// Newer M-series chips support more efficient memory barriers
|
||||||
|
if (self.device_info != null and
|
||||||
|
self.device_info.?.is_m_series and
|
||||||
|
self.device_info.?.series_generation >= 2) {
|
||||||
|
return "MTLBarrierScopeBuffers";
|
||||||
|
}
|
||||||
|
|
||||||
|
return "MTLBarrierScopeTextures | MTLBarrierScopeBuffers";
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Generate compilation options string for Metal API
|
||||||
|
pub fn getCompileOptionsString(self: *Self) []const u8 {
|
||||||
|
_ = self;
|
||||||
|
// In a real implementation, this would return Objective-C code to set up
|
||||||
|
// MTLCompileOptions with the appropriate parameters
|
||||||
|
return "MTLCompileOptions"; // Placeholder
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/// Create optimized Metal shaders for key operations based on device capabilities
|
||||||
|
pub fn createOptimizedMetalShaders(device_info: ?MetalDeviceInfo) struct {
|
||||||
|
matmul: []const u8,
|
||||||
|
rms_norm: []const u8,
|
||||||
|
swiglu: []const u8,
|
||||||
|
attention: []const u8,
|
||||||
|
} {
|
||||||
|
// Base versions of shaders
|
||||||
|
const base_matmul_shader =
|
||||||
|
\\#include <metal_stdlib>
|
||||||
|
\\using namespace metal;
|
||||||
|
\\
|
||||||
|
\\kernel void matmul_kernel(
|
||||||
|
\\ device const float* a [[buffer(0)]],
|
||||||
|
\\ device const float* b [[buffer(1)]],
|
||||||
|
\\ device float* c [[buffer(2)]],
|
||||||
|
\\ constant uint& M [[buffer(3)]],
|
||||||
|
\\ constant uint& N [[buffer(4)]],
|
||||||
|
\\ constant uint& K [[buffer(5)]],
|
||||||
|
\\ uint2 gid [[thread_position_in_grid]]
|
||||||
|
\\) {
|
||||||
|
\\ if (gid.x >= N || gid.y >= M) return;
|
||||||
|
\\
|
||||||
|
\\ float sum = 0.0;
|
||||||
|
\\ for (uint k = 0; k < K; k++) {
|
||||||
|
\\ sum += a[gid.y * K + k] * b[k * N + gid.x];
|
||||||
|
\\ }
|
||||||
|
\\ c[gid.y * N + gid.x] = sum;
|
||||||
|
\\}
|
||||||
|
;
|
||||||
|
|
||||||
|
const base_rms_norm_shader =
|
||||||
|
\\#include <metal_stdlib>
|
||||||
|
\\using namespace metal;
|
||||||
|
\\
|
||||||
|
\\kernel void rms_norm_kernel(
|
||||||
|
\\ device const float* input [[buffer(0)]],
|
||||||
|
\\ device const float* weight [[buffer(1)]],
|
||||||
|
\\ device float* output [[buffer(2)]],
|
||||||
|
\\ constant uint& size [[buffer(3)]],
|
||||||
|
\\ constant float& eps [[buffer(4)]],
|
||||||
|
\\ uint idx [[thread_position_in_grid]]
|
||||||
|
\\) {
|
||||||
|
\\ if (idx >= size) return;
|
||||||
|
\\
|
||||||
|
\\ // Calculate sum of squares
|
||||||
|
\\ float sum_sq = 0.0;
|
||||||
|
\\ for (uint i = 0; i < size; i++) {
|
||||||
|
\\ float val = input[i];
|
||||||
|
\\ sum_sq += val * val;
|
||||||
|
\\ }
|
||||||
|
\\
|
||||||
|
\\ // RMS normalization
|
||||||
|
\\ float rms = sqrt(sum_sq / size + eps);
|
||||||
|
\\ output[idx] = input[idx] / rms * weight[idx];
|
||||||
|
\\}
|
||||||
|
;
|
||||||
|
|
||||||
|
// Default implementations
|
||||||
|
var matmul = base_matmul_shader;
|
||||||
|
var rms_norm = base_rms_norm_shader;
|
||||||
|
var swiglu = ""; // Placeholder
|
||||||
|
var attention = ""; // Placeholder
|
||||||
|
|
||||||
|
// For M-series chips, we can use optimized implementations
|
||||||
|
if (device_info != null and device_info.?.is_m_series) {
|
||||||
|
// M3 optimizations
|
||||||
|
if (device_info.?.series_generation >= 3) {
|
||||||
|
// M3 has improved threadgroup memory, use tiled implementation
|
||||||
|
matmul =
|
||||||
|
\\#include <metal_stdlib>
|
||||||
|
\\using namespace metal;
|
||||||
|
\\
|
||||||
|
\\kernel void matmul_kernel_optimized_m3(
|
||||||
|
\\ device const float* a [[buffer(0)]],
|
||||||
|
\\ device const float* b [[buffer(1)]],
|
||||||
|
\\ device float* c [[buffer(2)]],
|
||||||
|
\\ constant uint& M [[buffer(3)]],
|
||||||
|
\\ constant uint& N [[buffer(4)]],
|
||||||
|
\\ constant uint& K [[buffer(5)]],
|
||||||
|
\\ uint2 gid [[thread_position_in_grid]],
|
||||||
|
\\ uint2 tid [[thread_position_in_threadgroup]],
|
||||||
|
\\ uint2 tgid [[threadgroup_position_in_grid]]
|
||||||
|
\\) {
|
||||||
|
\\ // Advanced implementation with tiling and local memory
|
||||||
|
\\ // Optimized for M3 architecture
|
||||||
|
\\ // ...
|
||||||
|
\\}
|
||||||
|
;
|
||||||
|
|
||||||
|
// Similar optimizations for other kernels...
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return .{
|
||||||
|
.matmul = matmul,
|
||||||
|
.rms_norm = rms_norm,
|
||||||
|
.swiglu = swiglu,
|
||||||
|
.attention = attention,
|
||||||
|
};
|
||||||
|
}
|
39
experimental/src/test_m_series.zig
Normal file
39
experimental/src/test_m_series.zig
Normal file
@ -0,0 +1,39 @@
|
|||||||
|
// Test program for M series detection
|
||||||
|
const std = @import("std");
|
||||||
|
const metal_device = @import("backends/metal/device.zig");
|
||||||
|
|
||||||
|
pub fn main() !void {
|
||||||
|
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
|
||||||
|
defer _ = gpa.deinit();
|
||||||
|
|
||||||
|
const allocator = gpa.allocator();
|
||||||
|
|
||||||
|
std.log.info("Testing M series detection...", .{});
|
||||||
|
|
||||||
|
// Detect Apple Silicon and M-series capabilities
|
||||||
|
const device_info = try metal_device.detectAppleSilicon(allocator);
|
||||||
|
defer {
|
||||||
|
allocator.free(device_info.device_name);
|
||||||
|
allocator.free(device_info.variant);
|
||||||
|
}
|
||||||
|
|
||||||
|
std.log.info("Device Info:", .{});
|
||||||
|
std.log.info(" Device Name: {s}", .{device_info.device_name});
|
||||||
|
std.log.info(" Is Apple Silicon: {}", .{device_info.is_apple_silicon});
|
||||||
|
std.log.info(" Is M Series: {}", .{device_info.is_m_series});
|
||||||
|
|
||||||
|
if (device_info.is_m_series) {
|
||||||
|
std.log.info(" M Series Generation: {}", .{device_info.series_generation});
|
||||||
|
std.log.info(" Variant: {s}", .{device_info.variant});
|
||||||
|
}
|
||||||
|
|
||||||
|
std.log.info(" Unified Memory: {} GB", .{device_info.unified_memory_size / (1024 * 1024 * 1024)});
|
||||||
|
std.log.info(" Has Apple Neural Engine: {}", .{device_info.has_anc});
|
||||||
|
|
||||||
|
// Test other utility functions
|
||||||
|
std.log.info("Optimal Work Group Size: {}", .{metal_device.getOptimalWorkGroupSize()});
|
||||||
|
std.log.info("Memory Strategy: {s}", .{@tagName(metal_device.getMemoryStrategy())});
|
||||||
|
std.log.info("Optimal Tensor Block Size: {}", .{metal_device.getOptimalTensorBlockSize()});
|
||||||
|
|
||||||
|
std.log.info("Test complete!", .{});
|
||||||
|
}
|
Binary file not shown.
Loading…
Reference in New Issue
Block a user