mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-07-04 23:41:37 -04:00
Merge pull request #2 from Triex:feat-Enhanced-device-detection-handling-initial-metal
Feat-Enhanced-device-detection-handling-initial-metal
This commit is contained in:
commit
24d94f7c21
1
.gitignore
vendored
1
.gitignore
vendored
@ -173,3 +173,4 @@ cython_debug/
|
||||
|
||||
# Zig
|
||||
experimental/.zig-cache/
|
||||
zig-out/
|
136
README.md
136
README.md
@ -16,18 +16,22 @@
|
||||
</div>
|
||||
<hr />
|
||||
|
||||
<h1 align="center"> DeepZig V3: A High-Performance LLM Architecture</h1>
|
||||
# DeepZig V3: A High-Performance LLM Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
A proposal & foundation for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
|
||||
A **DRAFT proposal & foundation** for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
|
||||
|
||||
**Status Update**: ✅ **Foundation compiles clean theoretical implementation** with Zig 0.15.0-dev, including:
|
||||
HTTP server with modern Zig API
|
||||
- SIMD-optimized tensor operations
|
||||
- Cross-platform backend architecture
|
||||
- Initial memory management
|
||||
- Comprehensive build system draft
|
||||
**⚠️ Status: EXPERIMENTAL DRAFT** ✅ **Foundation compiles with Zig 0.15.0-dev**, including:
|
||||
- ✅ HTTP server framework (basic structure)
|
||||
- ✅ SIMD-optimized tensor operations (draft implementation)
|
||||
- ✅ Cross-platform backend architecture
|
||||
- ✅ Initial memory management
|
||||
- ✅ **Apple Silicon M-series detection** (hardware detection via sysctl)
|
||||
- ✅ Comprehensive build system draft
|
||||
- ⚠️ **NOT PRODUCTION READY** - Draft implementation for research/development
|
||||
|
||||
**Performance Note**: Current naive algorithms are ~1000x slower than optimized BLAS. Matrix multiplication: 640ms for 1024×1024. This is expected for a foundational draft implementation. See [experimental benchmarks](experimental/README.md#benchmarks) for detailed performance data.
|
||||
|
||||
## Why This Matters
|
||||
|
||||
@ -37,6 +41,18 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
||||
- **Complex deployment** with heavy runtimes
|
||||
- **Platform lock-in** due to dependency complexity
|
||||
|
||||
## Expected Benefits vs Current Reality
|
||||
|
||||
| Aspect | Current (PyTorch) | Target (Zig) | **Current Draft** |
|
||||
|--------|------------------|--------------|-------------------|
|
||||
| Cold start | 10-30s | **< 2s** | *Not measured* |
|
||||
| Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* |
|
||||
| Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** |
|
||||
| Deployment | Complex | **Copy & run** | ✅ **Copy & run** |
|
||||
| Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | *6418ms (naive)* |
|
||||
|
||||
*See [experimental benchmarks](experimental/README.md#benchmarks) for current performance measurements.*
|
||||
|
||||
## Why Zig?
|
||||
|
||||
**Performance**: Zero-cost abstractions, compile-time optimization, direct hardware access<br/>
|
||||
@ -56,14 +72,14 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
||||
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
## Proposed Web API
|
||||
## Draft Web API Framework
|
||||
|
||||
### Target Endpoints
|
||||
### Planned Endpoints (Basic Structure Implemented)
|
||||
- `POST /v1/chat/completions` - OpenAI-compatible chat API
|
||||
- `POST /v1/completions` - Text completion
|
||||
- `GET /v1/models` - List available models
|
||||
- `GET /health` - Service health check
|
||||
- `WebSocket /ws` - Streaming inference
|
||||
- `WebSocket /ws` - Streaming inference (planned)
|
||||
|
||||
### Deployment Vision
|
||||
- **Static binaries** - Single file deployment, no dependencies
|
||||
@ -72,56 +88,55 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
||||
- **Serverless functions** - Minimal cold start with static linking
|
||||
- **WebAssembly** - Browser inference without additional runtime
|
||||
|
||||
## Implementation Plan
|
||||
## Implementation Plan Status
|
||||
|
||||
### Phase 1: Foundation ✅ **DRAFTED**
|
||||
### Phase 1: Foundation ✅ **DRAFT COMPLETE**
|
||||
- [x] Set up Zig project structure
|
||||
- [x] Implement basic tensor operations with SIMD
|
||||
- [x] Create memory management system (arena allocators)
|
||||
- [x] Build HTTP server framework
|
||||
- [x] **Apple Silicon detection via sysctl calls**
|
||||
- [x] **Updated to Zig 0.15.0-dev - compiles cleanly**
|
||||
- [x] **Benchmark suite** showing current performance
|
||||
|
||||
### Phase 2: Core Model
|
||||
*📈 Performance baseline established - see [benchmarks](experimental/README.md#benchmarks)*
|
||||
|
||||
### Phase 2: Core Model (IN PROGRESS)
|
||||
- [ ] Implement transformer layers
|
||||
- [ ] Add Multi-Head Latent Attention (MLA)
|
||||
- [ ] Build Mixture of Experts (MoE) routing
|
||||
- [ ] Create tokenizer integration
|
||||
|
||||
### Phase 3: Backends
|
||||
### Phase 3: Backends (PLANNED)
|
||||
- [ ] Optimize CPU backend with AVX/NEON
|
||||
- [ ] Integrate Metal for Apple Silicon
|
||||
- [ ] Add CUDA support for NVIDIA GPUs
|
||||
- [ ] Implement WebGPU for browsers
|
||||
|
||||
### Phase 4: Web Integration
|
||||
### Phase 4: Web Integration (DRAFT STRUCTURE)
|
||||
- [x] Complete HTTP API implementation (basic structure)
|
||||
- [ ] Add WebSocket streaming
|
||||
- [ ] Build authentication/rate limiting
|
||||
- [ ] Create deployment tooling
|
||||
|
||||
## Expected Benefits
|
||||
|
||||
| Aspect | Current (PyTorch) | Proposed (Zig) |
|
||||
|--------|------------------|----------------|
|
||||
| Cold start | 10-30s | **< 2s** |
|
||||
| Memory usage | 20-40GB | **< 16GB** |
|
||||
| Dependencies | ~2GB runtime | **Single binary** |
|
||||
| Deployment | Complex | **Copy & run** |
|
||||
|
||||
## Technical Challenges
|
||||
|
||||
- **Model Complexity**: DeepSeek V3's MoE architecture requires careful memory management
|
||||
- **Backend Integration**: Need efficient FFI to CUDA/Metal while maintaining performance
|
||||
- **Web Scale**: Handle concurrent requests without blocking inference
|
||||
- **Accuracy**: Match PyTorch numerical precision
|
||||
- **Performance**: Current implementation is 1000x slower than optimised BLAS - major optimization needed
|
||||
|
||||
## Platform-Specific Opportunities
|
||||
|
||||
### Apple Silicon (M-Series)
|
||||
### Apple Silicon (M-Series) ✅ **Draft Detection Implemented**
|
||||
- **Metal Performance Shaders** integration for matrix operations
|
||||
- **AMX instruction set** access for accelerated linear algebra
|
||||
- **Unified memory architecture** exploitation for zero-copy transfers
|
||||
- **Power efficiency tuning** across P and E cores
|
||||
- **✅ Proper M1/M2/M3/M4 detection** via system calls
|
||||
|
||||
*Current status: Hardware detection working, GPU acceleration not yet implemented.*
|
||||
|
||||
### x86_64 Architecture
|
||||
- **AVX-512 vectorization** with masked operations
|
||||
@ -137,39 +152,29 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
|
||||
|
||||
## Getting Started
|
||||
|
||||
**Current Status**: This repository contains the original Python DeepSeek V3 implementation. The Zig implementation is proposed future work.
|
||||
**Current Status**: This repository contains a **DRAFT EXPERIMENTAL** Zig implementation foundation.
|
||||
|
||||
### For the Current Python Implementation:
|
||||
### For the Current Zig Implementation:
|
||||
```bash
|
||||
# Clone this repository
|
||||
git clone https://github.com/[current-repo-path]
|
||||
cd DeepSeek-V3-Zig
|
||||
git clone https://github.com/Triex/DeepZig-V3
|
||||
cd DeepSeek-V3-Zig/experimental
|
||||
|
||||
# Follow existing Python setup instructions
|
||||
# (see original DeepSeek V3 documentation)
|
||||
```
|
||||
# Build and test the foundation
|
||||
zig build
|
||||
|
||||
### For the Proposed Zig Implementation:
|
||||
```bash
|
||||
# This would be the future workflow once implemented:
|
||||
# Run the HTTP server (basic structure)
|
||||
zig build run -- --port 8080
|
||||
|
||||
# 1. Set up new Zig project structure
|
||||
zig init-exe deepseek-v3-zig
|
||||
|
||||
# 2. Implement core components
|
||||
# - Tensor operations with SIMD
|
||||
# - HTTP server framework
|
||||
# - Model architecture
|
||||
|
||||
# 3. Test and benchmark
|
||||
zig build test
|
||||
# Run benchmarks (see actual performance)
|
||||
zig build bench
|
||||
|
||||
# 4. Run web server
|
||||
zig build run -- --port 8080
|
||||
# Test Apple Silicon detection
|
||||
zig build-exe src/test_m_series.zig -I src -lc -framework Metal -framework Foundation
|
||||
./test_m_series
|
||||
```
|
||||
|
||||
**Want to contribute to making this real?** See [Seeking Contributors](#seeking-contributors) below.
|
||||
**📊 Performance Reality Check**: See [experimental/README.md](experimental/README.md) for actual benchmark results showing current performance limitations and optimisation opportunities.
|
||||
|
||||
## Development Approach
|
||||
|
||||
@ -183,38 +188,27 @@ Reference: [Zig Cookbook](https://zigcc.github.io/zig-cookbook/) for implementat
|
||||
|
||||
## Seeking Contributors
|
||||
|
||||
This is an ambitious project that would benefit from expertise in:
|
||||
This is an ambitious **DRAFT project** that would benefit from expertise in:
|
||||
- **Performance optimization** (current bottleneck: naive matrix operations)
|
||||
- **Zig systems programming**
|
||||
- **GPU kernel optimization** (CUDA/Metal)
|
||||
- **ML model implementation**
|
||||
- **Web server development**
|
||||
- **Performance optimization**
|
||||
- **Hardware-software co-design**
|
||||
- **Novel inference techniques** (Speculative decoding, quantization)
|
||||
|
||||
## Project Timeline
|
||||
## Current Limitations & Next Steps
|
||||
|
||||
- Foundation and basic tensor ops
|
||||
- Core transformer implementation
|
||||
- Backend optimization and web API
|
||||
- Testing, benchmarking, deployment tools
|
||||
**🚧 What's Working**: Compiles, runs, measures performance
|
||||
**⚠️ What's Missing**: Optimized algorithms, robust flows, actual DeepSeek V3 model
|
||||
**📊 Performance Gap**: 1000x slower than production systems
|
||||
**🎯 Next Priority**: BLAS integration and GPU acceleration
|
||||
|
||||
## Key Questions
|
||||
|
||||
**Q: Why not just optimize PyTorch?**
|
||||
A: PyTorch's Python overhead and GC pauses are fundamental limitations. Zig offers zero-cost abstractions, superior error handling, and deterministic performance.
|
||||
|
||||
**Q: How will this compare to llama.cpp?**
|
||||
A: Similar performance goals, but with built-in web API, better memory management, and focus on DeepSeek V3's specific MoE architecture.
|
||||
|
||||
**Q: What about ONNX/TensorRT/ZML etc?**
|
||||
A: Those are inference runtimes, not development frameworks / LLM frameworks. This project enables rapid iteration and custom optimization for research.
|
||||
|
||||
---
|
||||
See [experimental implementation](experimental/) for technical details and current benchmarks.
|
||||
|
||||
## References
|
||||
|
||||
- [DeepZig V3 (Experimental Start)](https://github.com/Triex/DeepZig-V3/tree/main/experimental)
|
||||
- [DeepZig V3 (Experimental Implementation)](experimental/) - **Current working code**
|
||||
- [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437) - Original model architecture
|
||||
- [Zig Language](https://ziglang.org/) - Language documentation
|
||||
- [Awesome Zig](https://github.com/C-BJ/awesome-zig) - Community resources
|
||||
@ -225,5 +219,7 @@ A: Those are inference runtimes, not development frameworks / LLM frameworks. Th
|
||||
|
||||
---
|
||||
|
||||
**Status**: 🎯 Seeking feedback & idea expansion<br/>
|
||||
**Status**: 🎯 **EXPERIMENTAL DRAFT** - Foundation compiles and runs basic operations ([see benchmarks](experimental/README.md#benchmarks))<br/>
|
||||
**Vision**: Foundation for advanced AI reasoning research
|
||||
|
||||
**⚠️ Important**: This is a **research/development foundation** with draft/base implementations. Not ready for production use.
|
||||
|
@ -9,6 +9,7 @@ A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/)
|
||||
> - ✅ **SIMD-optimized tensor operations** (AVX2, NEON)
|
||||
> - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
|
||||
> - ✅ **Memory management** and backend architecture
|
||||
> - ✅ **Apple Silicon detection via sysctl calls**
|
||||
>
|
||||
> **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.<br/>
|
||||
> **Performance Note**: Current implementation uses naive algorithms - matrix multiplication is ~1000x slower than optimized BLAS. See [benchmarks](#benchmarks) below.<br/>
|
||||
@ -25,6 +26,8 @@ This experimental implementation aims to leverage Zig's unique advantages for sy
|
||||
- **Single binary deployment** with no runtime dependencies
|
||||
- **Cross-platform compilation** for multiple architectures
|
||||
|
||||
**🔗 Related**: See the [main project README](../README.md) for architecture overview and vision.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
@ -230,20 +233,26 @@ Run benchmarks to measure performance:
|
||||
zig build bench
|
||||
```
|
||||
|
||||
**Hardware Context**: Benchmarks run on Apple M1 MacBook Pro (MacBookPro17,1) with 16GB unified memory, Zig 0.15.0-dev.703+597dd328e, debug build.
|
||||
|
||||
Example output:
|
||||
```
|
||||
🚀 DeepZig V3 Performance Benchmarks
|
||||
==========================================
|
||||
|
||||
Backend: CPU (SIMD optimized)
|
||||
Architecture: x86_64
|
||||
Thread count: 16
|
||||
Architecture: aarch64
|
||||
Thread count: 8
|
||||
Hardware: Apple M1 MacBook Pro, 16GB unified memory
|
||||
|
||||
Operation | Iterations | Avg Time | Operations/s | Memory
|
||||
-------------------------------|------------|-----------|--------------|-------
|
||||
Tensor Creation (1024x1024) | 1000 iter | 2.03 ms | 493 ops/s | 4.0 MB
|
||||
Tensor Addition (SIMD) | 100 iter | 1.49 ms | 2806962690 ops/s | 48.0 MB
|
||||
Matrix Multiplication | 10 iter | 6418.08 ms | 0 GFLOPS | 12.0 MB
|
||||
Matrix Multiplication | 10 iter | 6418.08 ms | 0 GFLOPS | 12.0 MB
|
||||
SwiGLU Activation | 1000 iter | 4.44 ms | 236002478 ops/s | 12.0 MB
|
||||
RMS Normalization (SIMD) | 1000 iter | 0.00 ms | 1077586 ops/s | 0.0 MB
|
||||
Memory Bandwidth | 100 iter | 4.92 ms | 13 ops/s | 128.0 MB
|
||||
```
|
||||
|
||||
## Known Issues
|
||||
|
306
experimental/src/backends/metal/device.zig
Normal file
306
experimental/src/backends/metal/device.zig
Normal file
@ -0,0 +1,306 @@
|
||||
// Metal Device detection and handling for Apple Silicon
|
||||
// Specifically optimized for M-series chips using proper system detection
|
||||
|
||||
const std = @import("std");
|
||||
const Allocator = std.mem.Allocator;
|
||||
const c = std.c;
|
||||
|
||||
// Device information structure
|
||||
pub const MetalDeviceInfo = struct {
|
||||
device_name: []const u8,
|
||||
is_apple_silicon: bool,
|
||||
is_m_series: bool,
|
||||
series_generation: u8, // 1 = M1, 2 = M2, 3 = M3, etc.
|
||||
variant: []const u8, // "Pro", "Max", "Ultra", etc.
|
||||
unified_memory_size: u64,
|
||||
has_anc: bool, // Apple Neural Engine
|
||||
|
||||
pub fn format(
|
||||
self: @This(),
|
||||
comptime fmt: []const u8,
|
||||
options: std.fmt.FormatOptions,
|
||||
writer: anytype,
|
||||
) !void {
|
||||
_ = fmt;
|
||||
_ = options;
|
||||
try writer.print("Metal Device: {s} ({s}{d} {s})", .{
|
||||
self.device_name,
|
||||
if (self.is_m_series) "M" else "",
|
||||
if (self.is_m_series) self.series_generation else 0,
|
||||
if (self.is_m_series) self.variant else "",
|
||||
});
|
||||
try writer.print("\nUnified Memory: {} GB", .{self.unified_memory_size / (1024 * 1024 * 1024)});
|
||||
try writer.print("\nApple Neural Engine: {}", .{if (self.has_anc) "Available" else "Not Available"});
|
||||
}
|
||||
};
|
||||
|
||||
// M-series chip information
|
||||
const MSeriesInfo = struct {
|
||||
is_m_series: bool,
|
||||
generation: u8,
|
||||
variant: []const u8,
|
||||
};
|
||||
|
||||
// System detection using sysctl
|
||||
const SysctlError = error{
|
||||
NotFound,
|
||||
BufferTooSmall,
|
||||
SystemError,
|
||||
};
|
||||
|
||||
/// Get sysctl string value
|
||||
fn getSysctlString(allocator: Allocator, name: []const u8) ![]const u8 {
|
||||
// Only available on macOS
|
||||
if (@import("builtin").os.tag != .macos) {
|
||||
return SysctlError.NotFound;
|
||||
}
|
||||
|
||||
var size: usize = 0;
|
||||
|
||||
// First, get the size needed
|
||||
const name_cstr = try allocator.dupeZ(u8, name);
|
||||
defer allocator.free(name_cstr);
|
||||
|
||||
if (c.sysctlbyname(name_cstr.ptr, null, &size, null, 0) != 0) {
|
||||
return SysctlError.NotFound;
|
||||
}
|
||||
|
||||
// Allocate buffer and get the actual value
|
||||
const buffer = try allocator.alloc(u8, size);
|
||||
defer allocator.free(buffer);
|
||||
|
||||
if (c.sysctlbyname(name_cstr.ptr, buffer.ptr, &size, null, 0) != 0) {
|
||||
return SysctlError.SystemError;
|
||||
}
|
||||
|
||||
// Return a copy of the string (minus null terminator if present)
|
||||
const len = if (size > 0 and buffer[size - 1] == 0) size - 1 else size;
|
||||
return try allocator.dupe(u8, buffer[0..len]);
|
||||
}
|
||||
|
||||
/// Get sysctl integer value
|
||||
fn getSysctlInt(comptime T: type, name: []const u8, allocator: Allocator) !T {
|
||||
if (@import("builtin").os.tag != .macos) {
|
||||
return SysctlError.NotFound;
|
||||
}
|
||||
|
||||
var value: T = 0;
|
||||
var size: usize = @sizeOf(T);
|
||||
|
||||
const name_cstr = try allocator.dupeZ(u8, name);
|
||||
defer allocator.free(name_cstr);
|
||||
|
||||
if (c.sysctlbyname(name_cstr.ptr, &value, &size, null, 0) != 0) {
|
||||
return SysctlError.NotFound;
|
||||
}
|
||||
|
||||
return value;
|
||||
}
|
||||
|
||||
/// Check if running under Rosetta 2 translation
|
||||
fn isRunningUnderRosetta(allocator: Allocator) bool {
|
||||
const result = getSysctlInt(i32, "sysctl.proc_translated", allocator) catch return false;
|
||||
return result == 1;
|
||||
}
|
||||
|
||||
/// Check if hardware supports ARM64 (Apple Silicon)
|
||||
fn isAppleSiliconHardware(allocator: Allocator) bool {
|
||||
// Check for ARM64 support
|
||||
const arm64_support = getSysctlInt(i32, "hw.optional.arm64", allocator) catch return false;
|
||||
if (arm64_support == 1) return true;
|
||||
|
||||
// Alternative check: CPU architecture
|
||||
if (@import("builtin").target.cpu.arch == .aarch64) return true;
|
||||
|
||||
// If running under Rosetta, we're on Apple Silicon
|
||||
return isRunningUnderRosetta(allocator);
|
||||
}
|
||||
|
||||
/// Parse M-series information from CPU brand string
|
||||
fn parseMSeriesInfo(cpu_brand: []const u8) MSeriesInfo {
|
||||
// Default values
|
||||
var result = MSeriesInfo{
|
||||
.is_m_series = false,
|
||||
.generation = 0,
|
||||
.variant = "",
|
||||
};
|
||||
|
||||
// Look for Apple M pattern
|
||||
if (std.mem.indexOf(u8, cpu_brand, "Apple M") == null) {
|
||||
return result;
|
||||
}
|
||||
|
||||
result.is_m_series = true;
|
||||
|
||||
// Extract generation and variant from CPU brand string
|
||||
// Examples: "Apple M1", "Apple M1 Pro", "Apple M1 Max", "Apple M1 Ultra"
|
||||
if (std.mem.indexOf(u8, cpu_brand, "M1")) |_| {
|
||||
result.generation = 1;
|
||||
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||
result.variant = "Pro";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||
result.variant = "Max";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||
result.variant = "Ultra";
|
||||
} else {
|
||||
// Just "Apple M1" - this is the regular M1
|
||||
result.variant = "";
|
||||
}
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, "M2")) |_| {
|
||||
result.generation = 2;
|
||||
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||
result.variant = "Pro";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||
result.variant = "Max";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||
result.variant = "Ultra";
|
||||
} else {
|
||||
result.variant = "";
|
||||
}
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, "M3")) |_| {
|
||||
result.generation = 3;
|
||||
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||
result.variant = "Pro";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||
result.variant = "Max";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||
result.variant = "Ultra";
|
||||
} else {
|
||||
result.variant = "";
|
||||
}
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, "M4")) |_| {
|
||||
result.generation = 4;
|
||||
if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
|
||||
result.variant = "Pro";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
|
||||
result.variant = "Max";
|
||||
} else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
|
||||
result.variant = "Ultra";
|
||||
} else {
|
||||
result.variant = "";
|
||||
}
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
/// Try to detect GPU configuration for more detailed chip identification
|
||||
fn detectGPUCores(allocator: Allocator) u32 {
|
||||
// Try to get GPU core count - this can help distinguish variants
|
||||
// Regular M1: 7-8 GPU cores
|
||||
// M1 Pro: 14-16 GPU cores
|
||||
// M1 Max: 24-32 GPU cores
|
||||
|
||||
// This is a placeholder - actual implementation would query Metal API
|
||||
// For now, return 0 to indicate unknown
|
||||
_ = allocator;
|
||||
return 0;
|
||||
}
|
||||
|
||||
/// Detect Apple Silicon and M-series chip capabilities using proper system detection
|
||||
pub fn detectAppleSilicon(allocator: Allocator) !MetalDeviceInfo {
|
||||
// Check at compile-time if we're on macOS
|
||||
const is_macos = @import("builtin").os.tag == .macos;
|
||||
if (!is_macos) {
|
||||
return MetalDeviceInfo{
|
||||
.device_name = try allocator.dupe(u8, "Non-macOS Device"),
|
||||
.is_apple_silicon = false,
|
||||
.is_m_series = false,
|
||||
.series_generation = 0,
|
||||
.variant = try allocator.dupe(u8, ""),
|
||||
.unified_memory_size = 0,
|
||||
.has_anc = false,
|
||||
};
|
||||
}
|
||||
|
||||
// Detect Apple Silicon hardware
|
||||
const is_apple_silicon = isAppleSiliconHardware(allocator);
|
||||
if (!is_apple_silicon) {
|
||||
return MetalDeviceInfo{
|
||||
.device_name = try allocator.dupe(u8, "Intel Mac"),
|
||||
.is_apple_silicon = false,
|
||||
.is_m_series = false,
|
||||
.series_generation = 0,
|
||||
.variant = try allocator.dupe(u8, ""),
|
||||
.unified_memory_size = 0,
|
||||
.has_anc = false,
|
||||
};
|
||||
}
|
||||
|
||||
// Get CPU brand string for M-series detection - this is the authoritative source
|
||||
const cpu_brand = getSysctlString(allocator, "machdep.cpu.brand_string") catch "Apple Silicon";
|
||||
defer allocator.free(cpu_brand);
|
||||
|
||||
std.log.debug("CPU Brand String: '{s}'", .{cpu_brand});
|
||||
|
||||
// Parse M-series information from the actual CPU brand string
|
||||
const m_info = parseMSeriesInfo(cpu_brand);
|
||||
|
||||
// Get additional hardware details for logging/debugging
|
||||
const hw_model = getSysctlString(allocator, "hw.model") catch "";
|
||||
defer if (hw_model.len > 0) allocator.free(hw_model);
|
||||
|
||||
const gpu_cores = detectGPUCores(allocator);
|
||||
if (gpu_cores > 0) {
|
||||
std.log.debug("GPU Cores: {}", .{gpu_cores});
|
||||
}
|
||||
|
||||
std.log.debug("Hardware Model: '{s}'", .{hw_model});
|
||||
std.log.debug("Detected M{d} {s}", .{ m_info.generation, m_info.variant });
|
||||
|
||||
// Get system memory
|
||||
const memory_size = getSysctlInt(u64, "hw.memsize", allocator) catch (16 * 1024 * 1024 * 1024); // Default 16GB
|
||||
|
||||
// Get device name
|
||||
const device_name = getSysctlString(allocator, "hw.model") catch "Apple Silicon Mac";
|
||||
|
||||
return MetalDeviceInfo{
|
||||
.device_name = device_name, // This will be owned by the caller
|
||||
.is_apple_silicon = true,
|
||||
.is_m_series = m_info.is_m_series,
|
||||
.series_generation = m_info.generation,
|
||||
.variant = try allocator.dupe(u8, m_info.variant), // Duplicate to ensure consistent allocation
|
||||
.unified_memory_size = memory_size,
|
||||
.has_anc = m_info.is_m_series, // All M-series have Apple Neural Engine
|
||||
};
|
||||
}
|
||||
|
||||
/// Get optimal GPU parameters for detected device
|
||||
pub fn getOptimalWorkGroupSize() u32 {
|
||||
// These are reasonable defaults that should work well on most Apple GPU architectures
|
||||
// In a real implementation, we would query Metal API for the actual optimal values
|
||||
if (@import("builtin").target.cpu.arch == .aarch64) {
|
||||
// Apple Silicon optimized values based on GPU core count
|
||||
return 128;
|
||||
}
|
||||
|
||||
// Default for Intel Macs and others
|
||||
return 64;
|
||||
}
|
||||
|
||||
/// Get recommended memory allocation strategy based on device capabilities
|
||||
pub fn getMemoryStrategy() enum { UnifiedMemory, DiscreteMemory } {
|
||||
// Check if we're on Apple Silicon hardware (even under Rosetta)
|
||||
if (@import("builtin").os.tag == .macos) {
|
||||
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
|
||||
defer _ = gpa.deinit();
|
||||
const allocator = gpa.allocator();
|
||||
|
||||
if (isAppleSiliconHardware(allocator)) {
|
||||
return .UnifiedMemory; // Apple Silicon uses unified memory
|
||||
}
|
||||
}
|
||||
|
||||
// For Intel Macs and other platforms
|
||||
return .DiscreteMemory;
|
||||
}
|
||||
|
||||
/// Get optimal tensor block size for current device
|
||||
pub fn getOptimalTensorBlockSize() u32 {
|
||||
if (@import("builtin").target.cpu.arch == .aarch64) {
|
||||
// Apple Silicon has more GPU cores and benefits from larger blocks
|
||||
return 256;
|
||||
} else {
|
||||
return 128;
|
||||
}
|
||||
}
|
152
experimental/src/backends/metal/memory.zig
Normal file
152
experimental/src/backends/metal/memory.zig
Normal file
@ -0,0 +1,152 @@
|
||||
// Metal-specific memory management for Apple Silicon
|
||||
// Optimized for the unified memory architecture of M-series chips
|
||||
|
||||
const std = @import("std");
|
||||
const Allocator = std.mem.Allocator;
|
||||
const device = @import("device.zig");
|
||||
const MetalDeviceInfo = device.MetalDeviceInfo;
|
||||
|
||||
/// Memory modes available for Metal buffers
|
||||
pub const MetalMemoryMode = enum {
|
||||
/// Shared between CPU and GPU with automatic migration
|
||||
Shared,
|
||||
|
||||
/// Managed with separate CPU and GPU views but synchronized
|
||||
Managed,
|
||||
|
||||
/// GPU-only storage for maximum performance
|
||||
Private,
|
||||
|
||||
/// Memory visible to both CPU and GPU (Apple Silicon only)
|
||||
Unified,
|
||||
};
|
||||
|
||||
/// Buffer usage patterns to optimize memory allocation
|
||||
pub const MetalBufferUsage = enum {
|
||||
/// Read often by GPU
|
||||
GpuRead,
|
||||
|
||||
/// Write often by GPU
|
||||
GpuWrite,
|
||||
|
||||
/// Read/write by both CPU and GPU
|
||||
Shared,
|
||||
|
||||
/// Used only temporarily for a single operation
|
||||
Transient,
|
||||
};
|
||||
|
||||
/// Memory manager for optimal Metal buffer allocation on M-series chips
|
||||
pub const MetalMemoryManager = struct {
|
||||
allocator: Allocator,
|
||||
device_info: ?MetalDeviceInfo,
|
||||
total_allocated: usize,
|
||||
max_allocation: usize,
|
||||
|
||||
const Self = @This();
|
||||
|
||||
/// Create a new Metal memory manager
|
||||
pub fn init(allocator: Allocator, device_info: ?MetalDeviceInfo) Self {
|
||||
return Self{
|
||||
.allocator = allocator,
|
||||
.device_info = device_info,
|
||||
.total_allocated = 0,
|
||||
.max_allocation = 0,
|
||||
};
|
||||
}
|
||||
|
||||
/// Clean up any resources
|
||||
pub fn deinit(self: *Self) void {
|
||||
// Release any cached buffers or other resources
|
||||
_ = self;
|
||||
}
|
||||
|
||||
/// Get the optimal memory mode based on device capabilities and usage pattern
|
||||
pub fn getOptimalMemoryMode(self: *Self, usage: MetalBufferUsage) MetalMemoryMode {
|
||||
// If we're on Apple Silicon, we can use unified memory
|
||||
const is_apple_silicon = self.device_info != null and self.device_info.?.is_apple_silicon;
|
||||
|
||||
if (is_apple_silicon) {
|
||||
return switch (usage) {
|
||||
.GpuRead => .Unified,
|
||||
.GpuWrite => .Unified,
|
||||
.Shared => .Unified,
|
||||
.Transient => .Private, // Even on unified memory, transient data is better in private
|
||||
};
|
||||
} else {
|
||||
// On Intel Macs with discrete GPU
|
||||
return switch (usage) {
|
||||
.GpuRead => .Managed,
|
||||
.GpuWrite => .Private,
|
||||
.Shared => .Managed,
|
||||
.Transient => .Private,
|
||||
};
|
||||
}
|
||||
}
|
||||
|
||||
/// Get recommended allocation size (aligned to device preferences)
|
||||
pub fn getOptimalAllocationSize(self: *Self, requested_size: usize) usize {
|
||||
// M-series chips prefer certain memory alignment patterns
|
||||
const alignment: usize = if (self.device_info != null and self.device_info.?.is_m_series)
|
||||
16 * 1024 // 16KB alignment on M-series
|
||||
else
|
||||
4 * 1024; // 4KB on other devices
|
||||
|
||||
return std.mem.alignForward(usize, requested_size, alignment);
|
||||
}
|
||||
|
||||
/// Track memory allocations for monitoring
|
||||
pub fn trackAllocation(self: *Self, size: usize) void {
|
||||
self.total_allocated += size;
|
||||
self.max_allocation = std.math.max(self.max_allocation, self.total_allocated);
|
||||
}
|
||||
|
||||
/// Track memory deallocations
|
||||
pub fn trackDeallocation(self: *Self, size: usize) void {
|
||||
if (self.total_allocated >= size) {
|
||||
self.total_allocated -= size;
|
||||
} else {
|
||||
self.total_allocated = 0;
|
||||
}
|
||||
}
|
||||
|
||||
/// Get memory usage statistics
|
||||
pub fn getMemoryStats(self: *Self) struct {
|
||||
current: usize,
|
||||
peak: usize,
|
||||
device_total: usize,
|
||||
} {
|
||||
const device_total = if (self.device_info != null)
|
||||
self.device_info.?.unified_memory_size
|
||||
else
|
||||
0;
|
||||
|
||||
return .{
|
||||
.current = self.total_allocated,
|
||||
.peak = self.max_allocation,
|
||||
.device_total = device_total,
|
||||
};
|
||||
}
|
||||
|
||||
/// Get recommended buffer storage mode string for Metal API
|
||||
pub fn getStorageModeString(mode: MetalMemoryMode) []const u8 {
|
||||
return switch (mode) {
|
||||
.Shared => "MTLStorageModeShared",
|
||||
.Managed => "MTLStorageModeManaged",
|
||||
.Private => "MTLStorageModePrivate",
|
||||
.Unified => "MTLStorageModeShared", // Unified uses Shared on the API level
|
||||
};
|
||||
}
|
||||
};
|
||||
|
||||
/// Helper to determine if hazard tracking should be enabled based on device capabilities
|
||||
pub fn shouldUseHazardTracking(device_info: ?MetalDeviceInfo) bool {
|
||||
if (device_info == null) return false;
|
||||
|
||||
// M3 and newer have better hazard tracking hardware
|
||||
if (device_info.?.is_m_series and device_info.?.series_generation >= 3) {
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
@ -4,12 +4,18 @@
|
||||
const std = @import("std");
|
||||
const deepseek_core = @import("deepseek_core");
|
||||
const Allocator = std.mem.Allocator;
|
||||
const metal_device = @import("device.zig");
|
||||
const MetalDeviceInfo = metal_device.MetalDeviceInfo;
|
||||
|
||||
/// Metal backend implementation for Apple Silicon
|
||||
pub const MetalBackend = struct {
|
||||
allocator: Allocator,
|
||||
device_available: bool,
|
||||
unified_memory_size: u64,
|
||||
device_info: ?MetalDeviceInfo,
|
||||
optimal_work_group_size: u32,
|
||||
memory_strategy: metal_device.getMemoryStrategy(),
|
||||
tensor_block_size: u32,
|
||||
|
||||
const Self = @This();
|
||||
|
||||
@ -17,10 +23,37 @@ pub const MetalBackend = struct {
|
||||
// Check if Metal is available (compile-time check for macOS)
|
||||
const metal_available = @import("builtin").os.tag == .macos;
|
||||
|
||||
var device_info: ?MetalDeviceInfo = null;
|
||||
var unified_memory_size: u64 = 0;
|
||||
var optimal_work_group_size: u32 = 64; // Default
|
||||
var tensor_block_size: u32 = 128; // Default
|
||||
|
||||
if (metal_available) {
|
||||
std.log.info("Metal Backend initialized on Apple Silicon");
|
||||
// TODO: Initialize MTLDevice and command queue
|
||||
// TODO: Query unified memory size
|
||||
// Detect Apple Silicon and M-series capabilities
|
||||
device_info = try metal_device.detectAppleSilicon(allocator);
|
||||
unified_memory_size = device_info.?.unified_memory_size;
|
||||
optimal_work_group_size = metal_device.getOptimalWorkGroupSize();
|
||||
tensor_block_size = metal_device.getOptimalTensorBlockSize();
|
||||
|
||||
std.log.info("Metal Backend initialized on {s}", .{device_info.?.device_name});
|
||||
// Log detailed device information
|
||||
if (device_info.?.is_apple_silicon) {
|
||||
if (device_info.?.is_m_series) {
|
||||
std.log.info("Detected M{d} {s} with {d}GB unified memory",
|
||||
.{
|
||||
device_info.?.series_generation,
|
||||
device_info.?.variant,
|
||||
unified_memory_size / (1024 * 1024 * 1024),
|
||||
}
|
||||
);
|
||||
} else {
|
||||
std.log.info("Detected Apple Silicon (non-M series) with {d}GB unified memory",
|
||||
.{unified_memory_size / (1024 * 1024 * 1024)}
|
||||
);
|
||||
}
|
||||
} else {
|
||||
std.log.warn("Metal is available but not running on Apple Silicon");
|
||||
}
|
||||
} else {
|
||||
std.log.warn("Metal Backend not available on this platform");
|
||||
}
|
||||
@ -28,7 +61,11 @@ pub const MetalBackend = struct {
|
||||
return Self{
|
||||
.allocator = allocator,
|
||||
.device_available = metal_available,
|
||||
.unified_memory_size = if (metal_available) 16 * 1024 * 1024 * 1024 else 0, // 16GB default
|
||||
.unified_memory_size = unified_memory_size,
|
||||
.device_info = device_info,
|
||||
.optimal_work_group_size = optimal_work_group_size,
|
||||
.memory_strategy = metal_device.getMemoryStrategy(),
|
||||
.tensor_block_size = tensor_block_size,
|
||||
};
|
||||
}
|
||||
|
||||
@ -54,14 +91,93 @@ pub const MetalBackend = struct {
|
||||
c.shape.dims[0], c.shape.dims[1]
|
||||
});
|
||||
|
||||
// Check if we're on Apple Silicon M series for optimized path
|
||||
if (self.device_info != null and self.device_info.?.is_m_series) {
|
||||
std.log.debug("Using optimized M{d} {s} matrix multiplication",
|
||||
.{
|
||||
self.device_info.?.series_generation,
|
||||
self.device_info.?.variant
|
||||
}
|
||||
);
|
||||
|
||||
// Select appropriate implementation based on M series generation
|
||||
switch (self.device_info.?.series_generation) {
|
||||
3 => return try self.matmulM3(a, b, c), // M3 optimized path
|
||||
2 => return try self.matmulM2(a, b, c), // M2 optimized path
|
||||
1 => return try self.matmulM1(a, b, c), // M1 optimized path
|
||||
else => {} // Fall through to generic implementation
|
||||
}
|
||||
}
|
||||
|
||||
// TODO: Implement actual Metal compute shader
|
||||
// This would involve:
|
||||
// 1. Create MTLBuffer from tensor data
|
||||
// 2. Set up compute pipeline with matmul shader
|
||||
// 3. Dispatch compute commands
|
||||
// 3. Dispatch compute commands with optimized workgroup size based on device
|
||||
// 4. Copy results back to tensor
|
||||
|
||||
// For now, fallback to CPU implementation
|
||||
std.log.warn("Falling back to CPU implementation, Metal not implemented");
|
||||
return error.NotImplemented;
|
||||
}
|
||||
|
||||
/// M1-optimized matrix multiplication
|
||||
fn matmulM1(
|
||||
self: *Self,
|
||||
a: *deepseek_core.Tensor,
|
||||
b: *const deepseek_core.Tensor,
|
||||
c: *deepseek_core.Tensor,
|
||||
) !void {
|
||||
_ = self;
|
||||
_ = a;
|
||||
_ = b;
|
||||
_ = c;
|
||||
|
||||
// TODO: M1-specific optimizations
|
||||
// - Use MPSMatrixMultiplication with M1-specific parameters
|
||||
// - Optimize for 7/8 GPU cores typically found in M1
|
||||
// - Account for unified memory bandwidth on M1
|
||||
|
||||
return error.NotImplemented;
|
||||
}
|
||||
|
||||
/// M2-optimized matrix multiplication
|
||||
fn matmulM2(
|
||||
self: *Self,
|
||||
a: *deepseek_core.Tensor,
|
||||
b: *const deepseek_core.Tensor,
|
||||
c: *deepseek_core.Tensor,
|
||||
) !void {
|
||||
_ = self;
|
||||
_ = a;
|
||||
_ = b;
|
||||
_ = c;
|
||||
|
||||
// TODO: M2-specific optimizations
|
||||
// - Use MPSMatrixMultiplication with M2-specific parameters
|
||||
// - Optimize for 8/10 GPU cores typically found in M2
|
||||
// - Account for increased memory bandwidth on M2
|
||||
|
||||
return error.NotImplemented;
|
||||
}
|
||||
|
||||
/// M3-optimized matrix multiplication
|
||||
fn matmulM3(
|
||||
self: *Self,
|
||||
a: *deepseek_core.Tensor,
|
||||
b: *const deepseek_core.Tensor,
|
||||
c: *deepseek_core.Tensor,
|
||||
) !void {
|
||||
_ = self;
|
||||
_ = a;
|
||||
_ = b;
|
||||
_ = c;
|
||||
|
||||
// TODO: M3-specific optimizations
|
||||
// - Use MPSMatrixMultiplication with M3-specific parameters
|
||||
// - Optimize for 10/16 GPU cores typically found in M3
|
||||
// - Account for dynamic core switching on M3
|
||||
|
||||
return error.NotImplemented;
|
||||
}
|
||||
|
||||
@ -77,16 +193,59 @@ pub const MetalBackend = struct {
|
||||
return error.MetalNotAvailable;
|
||||
}
|
||||
|
||||
_ = input;
|
||||
std.log.debug("Metal RMS normalization with {} elements", .{input.len});
|
||||
|
||||
// Check if we're on Apple Silicon M series for optimized path
|
||||
if (self.device_info != null and self.device_info.?.is_m_series) {
|
||||
std.log.debug("Using optimized M{d} {s} RMS normalization",
|
||||
.{
|
||||
self.device_info.?.series_generation,
|
||||
self.device_info.?.variant
|
||||
}
|
||||
);
|
||||
|
||||
// Select optimal workgroup size based on M series generation
|
||||
const workgroup_size = switch (self.device_info.?.series_generation) {
|
||||
3 => 256, // M3 has more GPU cores
|
||||
2 => 192, // M2 optimization
|
||||
else => 128, // M1 and others
|
||||
};
|
||||
|
||||
// Determine if we should use unified memory approach
|
||||
const use_unified_memory = self.memory_strategy == .UnifiedMemory;
|
||||
|
||||
// Calculate optimal thread count based on input size and GPU cores
|
||||
const thread_count = std.math.min(
|
||||
std.math.alignForward(usize, input.len, workgroup_size),
|
||||
workgroup_size * 1024 // Maximum reasonable thread count
|
||||
);
|
||||
|
||||
std.log.debug("RMS Norm using workgroup size: {}, threads: {}",
|
||||
.{workgroup_size, thread_count});
|
||||
|
||||
// TODO: Implement Metal compute shader for RMS norm with M-series optimizations
|
||||
// 1. Create buffers (potentially using managed storage mode for unified memory)
|
||||
// 2. Set up compute pipeline with RMS norm shader
|
||||
// 3. Dispatch compute with optimal work group size
|
||||
// 4. Handle results with zero-copy when possible on unified memory
|
||||
|
||||
if (!use_unified_memory) {
|
||||
// Would handle non-unified memory path differently
|
||||
std.log.debug("Using discrete memory path");
|
||||
}
|
||||
|
||||
// thread_count is used in the log message above, don't discard it
|
||||
}
|
||||
|
||||
// TODO: Complete implementation of Metal compute shader for RMS norm
|
||||
// Metal excels at parallel operations like normalization
|
||||
|
||||
// Don't discard input since it's used above for thread_count calculation
|
||||
// Only discard these if not used above
|
||||
_ = weight;
|
||||
_ = output;
|
||||
_ = eps;
|
||||
|
||||
std.log.debug("Metal RMS normalization");
|
||||
|
||||
// TODO: Implement Metal compute shader for RMS norm
|
||||
// Metal excels at parallel operations like normalization
|
||||
|
||||
return error.NotImplemented;
|
||||
}
|
||||
|
||||
|
254
experimental/src/backends/metal/shader.zig
Normal file
254
experimental/src/backends/metal/shader.zig
Normal file
@ -0,0 +1,254 @@
|
||||
// Metal shader utility for managing and optimizing Metal shaders
|
||||
// With specific optimizations for M-series Apple Silicon
|
||||
|
||||
const std = @import("std");
|
||||
const Allocator = std.mem.Allocator;
|
||||
const device = @import("device.zig");
|
||||
const MetalDeviceInfo = device.MetalDeviceInfo;
|
||||
|
||||
/// Optimization level for Metal shaders
|
||||
pub const ShaderOptimizationLevel = enum {
|
||||
none,
|
||||
default,
|
||||
performance,
|
||||
size,
|
||||
|
||||
/// Get the recommended optimization level based on device capabilities
|
||||
pub fn fromDeviceInfo(device_info: ?MetalDeviceInfo) ShaderOptimizationLevel {
|
||||
if (device_info == null) return .default;
|
||||
|
||||
if (device_info.?.is_m_series) {
|
||||
// M3 can handle highly optimized shaders
|
||||
if (device_info.?.series_generation >= 3) {
|
||||
return .performance;
|
||||
}
|
||||
// M1/M2 balance between performance and size
|
||||
else {
|
||||
return .default;
|
||||
}
|
||||
}
|
||||
|
||||
// For non-Apple Silicon, be more conservative
|
||||
return .default;
|
||||
}
|
||||
};
|
||||
|
||||
/// Metal shader types
|
||||
pub const ShaderType = enum {
|
||||
compute,
|
||||
vertex,
|
||||
fragment,
|
||||
|
||||
pub fn toMTLFunctionType(self: ShaderType) []const u8 {
|
||||
return switch (self) {
|
||||
.compute => "MTLFunctionTypeKernel",
|
||||
.vertex => "MTLFunctionTypeVertex",
|
||||
.fragment => "MTLFunctionTypeFragment",
|
||||
};
|
||||
}
|
||||
};
|
||||
|
||||
/// Metal shader source with metadata
|
||||
pub const ShaderSource = struct {
|
||||
name: []const u8,
|
||||
source_code: []const u8,
|
||||
shader_type: ShaderType,
|
||||
|
||||
/// Create a shader source with a given name and code
|
||||
pub fn init(name: []const u8, source_code: []const u8, shader_type: ShaderType) ShaderSource {
|
||||
return .{
|
||||
.name = name,
|
||||
.source_code = source_code,
|
||||
.shader_type = shader_type,
|
||||
};
|
||||
}
|
||||
};
|
||||
|
||||
/// Metal shader compilation options including M-series specific optimizations
|
||||
pub const ShaderCompileOptions = struct {
|
||||
optimization_level: ShaderOptimizationLevel,
|
||||
fast_math: bool,
|
||||
preserve_invariance: bool,
|
||||
|
||||
/// Create default options for a specific device
|
||||
pub fn forDevice(device_info: ?MetalDeviceInfo) ShaderCompileOptions {
|
||||
const opt_level = ShaderOptimizationLevel.fromDeviceInfo(device_info);
|
||||
|
||||
// M-series chips benefit from fast math but some algorithms require precision
|
||||
const fast_math = device_info != null and
|
||||
device_info.?.is_m_series and
|
||||
device_info.?.series_generation >= 2;
|
||||
|
||||
return .{
|
||||
.optimization_level = opt_level,
|
||||
.fast_math = fast_math,
|
||||
.preserve_invariance = false,
|
||||
};
|
||||
}
|
||||
};
|
||||
|
||||
/// Utility for managing Metal shader compilation and caching
|
||||
pub const ShaderManager = struct {
|
||||
allocator: Allocator,
|
||||
device_info: ?MetalDeviceInfo,
|
||||
compile_options: ShaderCompileOptions,
|
||||
|
||||
const Self = @This();
|
||||
|
||||
/// Create a new shader manager
|
||||
pub fn init(
|
||||
allocator: Allocator,
|
||||
device_info: ?MetalDeviceInfo
|
||||
) Self {
|
||||
return Self{
|
||||
.allocator = allocator,
|
||||
.device_info = device_info,
|
||||
.compile_options = ShaderCompileOptions.forDevice(device_info),
|
||||
};
|
||||
}
|
||||
|
||||
/// Clean up resources
|
||||
pub fn deinit(self: *Self) void {
|
||||
_ = self;
|
||||
}
|
||||
|
||||
/// Get optimal threadgroup size for a compute shader on current device
|
||||
pub fn getOptimalThreadgroupSize(self: *Self) struct { x: u32, y: u32, z: u32 } {
|
||||
if (self.device_info == null or !self.device_info.?.is_apple_silicon) {
|
||||
return .{ .x = 8, .y = 8, .z = 1 };
|
||||
}
|
||||
|
||||
// M-series chips have different optimal sizes
|
||||
if (self.device_info.?.is_m_series) {
|
||||
return switch (self.device_info.?.series_generation) {
|
||||
3 => .{ .x = 16, .y = 16, .z = 1 }, // M3 has more GPU cores
|
||||
2 => .{ .x = 16, .y = 8, .z = 1 }, // M2
|
||||
else => .{ .x = 8, .y = 8, .z = 1 }, // M1
|
||||
};
|
||||
}
|
||||
|
||||
return .{ .x = 8, .y = 8, .z = 1 };
|
||||
}
|
||||
|
||||
/// Get memory barrier type based on hardware capabilities
|
||||
pub fn getOptimalBarrierType(self: *Self) []const u8 {
|
||||
// Newer M-series chips support more efficient memory barriers
|
||||
if (self.device_info != null and
|
||||
self.device_info.?.is_m_series and
|
||||
self.device_info.?.series_generation >= 2) {
|
||||
return "MTLBarrierScopeBuffers";
|
||||
}
|
||||
|
||||
return "MTLBarrierScopeTextures | MTLBarrierScopeBuffers";
|
||||
}
|
||||
|
||||
/// Generate compilation options string for Metal API
|
||||
pub fn getCompileOptionsString(self: *Self) []const u8 {
|
||||
_ = self;
|
||||
// In a real implementation, this would return Objective-C code to set up
|
||||
// MTLCompileOptions with the appropriate parameters
|
||||
return "MTLCompileOptions"; // Placeholder
|
||||
}
|
||||
};
|
||||
|
||||
/// Create optimized Metal shaders for key operations based on device capabilities
|
||||
pub fn createOptimizedMetalShaders(device_info: ?MetalDeviceInfo) struct {
|
||||
matmul: []const u8,
|
||||
rms_norm: []const u8,
|
||||
swiglu: []const u8,
|
||||
attention: []const u8,
|
||||
} {
|
||||
// Base versions of shaders
|
||||
const base_matmul_shader =
|
||||
\\#include <metal_stdlib>
|
||||
\\using namespace metal;
|
||||
\\
|
||||
\\kernel void matmul_kernel(
|
||||
\\ device const float* a [[buffer(0)]],
|
||||
\\ device const float* b [[buffer(1)]],
|
||||
\\ device float* c [[buffer(2)]],
|
||||
\\ constant uint& M [[buffer(3)]],
|
||||
\\ constant uint& N [[buffer(4)]],
|
||||
\\ constant uint& K [[buffer(5)]],
|
||||
\\ uint2 gid [[thread_position_in_grid]]
|
||||
\\) {
|
||||
\\ if (gid.x >= N || gid.y >= M) return;
|
||||
\\
|
||||
\\ float sum = 0.0;
|
||||
\\ for (uint k = 0; k < K; k++) {
|
||||
\\ sum += a[gid.y * K + k] * b[k * N + gid.x];
|
||||
\\ }
|
||||
\\ c[gid.y * N + gid.x] = sum;
|
||||
\\}
|
||||
;
|
||||
|
||||
const base_rms_norm_shader =
|
||||
\\#include <metal_stdlib>
|
||||
\\using namespace metal;
|
||||
\\
|
||||
\\kernel void rms_norm_kernel(
|
||||
\\ device const float* input [[buffer(0)]],
|
||||
\\ device const float* weight [[buffer(1)]],
|
||||
\\ device float* output [[buffer(2)]],
|
||||
\\ constant uint& size [[buffer(3)]],
|
||||
\\ constant float& eps [[buffer(4)]],
|
||||
\\ uint idx [[thread_position_in_grid]]
|
||||
\\) {
|
||||
\\ if (idx >= size) return;
|
||||
\\
|
||||
\\ // Calculate sum of squares
|
||||
\\ float sum_sq = 0.0;
|
||||
\\ for (uint i = 0; i < size; i++) {
|
||||
\\ float val = input[i];
|
||||
\\ sum_sq += val * val;
|
||||
\\ }
|
||||
\\
|
||||
\\ // RMS normalization
|
||||
\\ float rms = sqrt(sum_sq / size + eps);
|
||||
\\ output[idx] = input[idx] / rms * weight[idx];
|
||||
\\}
|
||||
;
|
||||
|
||||
// Default implementations
|
||||
var matmul = base_matmul_shader;
|
||||
var rms_norm = base_rms_norm_shader;
|
||||
var swiglu = ""; // Placeholder
|
||||
var attention = ""; // Placeholder
|
||||
|
||||
// For M-series chips, we can use optimized implementations
|
||||
if (device_info != null and device_info.?.is_m_series) {
|
||||
// M3 optimizations
|
||||
if (device_info.?.series_generation >= 3) {
|
||||
// M3 has improved threadgroup memory, use tiled implementation
|
||||
matmul =
|
||||
\\#include <metal_stdlib>
|
||||
\\using namespace metal;
|
||||
\\
|
||||
\\kernel void matmul_kernel_optimized_m3(
|
||||
\\ device const float* a [[buffer(0)]],
|
||||
\\ device const float* b [[buffer(1)]],
|
||||
\\ device float* c [[buffer(2)]],
|
||||
\\ constant uint& M [[buffer(3)]],
|
||||
\\ constant uint& N [[buffer(4)]],
|
||||
\\ constant uint& K [[buffer(5)]],
|
||||
\\ uint2 gid [[thread_position_in_grid]],
|
||||
\\ uint2 tid [[thread_position_in_threadgroup]],
|
||||
\\ uint2 tgid [[threadgroup_position_in_grid]]
|
||||
\\) {
|
||||
\\ // Advanced implementation with tiling and local memory
|
||||
\\ // Optimized for M3 architecture
|
||||
\\ // ...
|
||||
\\}
|
||||
;
|
||||
|
||||
// Similar optimizations for other kernels...
|
||||
}
|
||||
}
|
||||
|
||||
return .{
|
||||
.matmul = matmul,
|
||||
.rms_norm = rms_norm,
|
||||
.swiglu = swiglu,
|
||||
.attention = attention,
|
||||
};
|
||||
}
|
39
experimental/src/test_m_series.zig
Normal file
39
experimental/src/test_m_series.zig
Normal file
@ -0,0 +1,39 @@
|
||||
// Test program for M series detection
|
||||
const std = @import("std");
|
||||
const metal_device = @import("backends/metal/device.zig");
|
||||
|
||||
pub fn main() !void {
|
||||
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
|
||||
defer _ = gpa.deinit();
|
||||
|
||||
const allocator = gpa.allocator();
|
||||
|
||||
std.log.info("Testing M series detection...", .{});
|
||||
|
||||
// Detect Apple Silicon and M-series capabilities
|
||||
const device_info = try metal_device.detectAppleSilicon(allocator);
|
||||
defer {
|
||||
allocator.free(device_info.device_name);
|
||||
allocator.free(device_info.variant);
|
||||
}
|
||||
|
||||
std.log.info("Device Info:", .{});
|
||||
std.log.info(" Device Name: {s}", .{device_info.device_name});
|
||||
std.log.info(" Is Apple Silicon: {}", .{device_info.is_apple_silicon});
|
||||
std.log.info(" Is M Series: {}", .{device_info.is_m_series});
|
||||
|
||||
if (device_info.is_m_series) {
|
||||
std.log.info(" M Series Generation: {}", .{device_info.series_generation});
|
||||
std.log.info(" Variant: {s}", .{device_info.variant});
|
||||
}
|
||||
|
||||
std.log.info(" Unified Memory: {} GB", .{device_info.unified_memory_size / (1024 * 1024 * 1024)});
|
||||
std.log.info(" Has Apple Neural Engine: {}", .{device_info.has_anc});
|
||||
|
||||
// Test other utility functions
|
||||
std.log.info("Optimal Work Group Size: {}", .{metal_device.getOptimalWorkGroupSize()});
|
||||
std.log.info("Memory Strategy: {s}", .{@tagName(metal_device.getMemoryStrategy())});
|
||||
std.log.info("Optimal Tensor Block Size: {}", .{metal_device.getOptimalTensorBlockSize()});
|
||||
|
||||
std.log.info("Test complete!", .{});
|
||||
}
|
Binary file not shown.
Loading…
Reference in New Issue
Block a user