Merge pull request #2 from Triex:feat-Enhanced-device-detection-handling-initial-metal

Feat-Enhanced-device-detection-handling-initial-metal
2025-07-04 23:41:37 -04:00 · 2025-06-11 17:50:54 +10:00 · 2025-06-11 17:50:54 +10:00 · 24d94f7c21
commit 24d94f7c21
parent 68c4c77600 7b81ea27d7
9 changed files with 1001 additions and 85 deletions
--- a/.gitignore
+++ b/.gitignore
@ -172,4 +172,5 @@ cython_debug/
 .DS_Store
 # Zig
-experimental/.zig-cache/
+experimental/.zig-cache/
 zig-out/
--- a/README.md
+++ b/README.md
@ -16,18 +16,22 @@
 </div>
 <hr />
-<h1 align="center"> DeepZig V3: A High-Performance LLM Architecture</h1>
+# DeepZig V3: A High-Performance LLM Architecture
 ## Overview
-A proposal & foundation for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
+A **DRAFT proposal & foundation** for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
-**Status Update**: ✅ **Foundation compiles clean theoretical implementation** with Zig 0.15.0-dev, including:
+**⚠️ Status: EXPERIMENTAL DRAFT** ✅ **Foundation compiles with Zig 0.15.0-dev**, including:
-   HTTP server with modern Zig API
+- ✅ HTTP server framework (basic structure)
- SIMD-optimized tensor operations  
+- ✅ SIMD-optimized tensor operations (draft implementation)
- Cross-platform backend architecture
+- ✅ Cross-platform backend architecture
- Initial memory management
+- ✅ Initial memory management
- Comprehensive build system draft
+- ✅ **Apple Silicon M-series detection** (hardware detection via sysctl)
 - ✅ Comprehensive build system draft
 - ⚠️ **NOT PRODUCTION READY** - Draft implementation for research/development
 **Performance Note**: Current naive algorithms are ~1000x slower than optimized BLAS. Matrix multiplication: 640ms for 1024×1024. This is expected for a foundational draft implementation. See [experimental benchmarks](experimental/README.md#benchmarks) for detailed performance data.
 ## Why This Matters
@ -37,6 +41,18 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 - **Complex deployment** with heavy runtimes
 - **Platform lock-in** due to dependency complexity
 ## Expected Benefits vs Current Reality
 | Aspect | Current (PyTorch) | Target (Zig) | **Current Draft** |
 |--------|------------------|--------------|-------------------|
 | Cold start | 10-30s | **< 2s** | *Not measured* |
 | Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* |
 | Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** |
 | Deployment | Complex | **Copy & run** | ✅ **Copy & run** |
 | Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | *6418ms (naive)* |
 *See [experimental benchmarks](experimental/README.md#benchmarks) for current performance measurements.*
 ## Why Zig?
 **Performance**: Zero-cost abstractions, compile-time optimization, direct hardware access<br/>
@ -56,14 +72,14 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 └─────────────────┘    └──────────────────┘    └─────────────────┘
 ```
-## Proposed Web API
+## Draft Web API Framework
-### Target Endpoints
+### Planned Endpoints (Basic Structure Implemented)
 - `POST /v1/chat/completions` - OpenAI-compatible chat API
 - `POST /v1/completions` - Text completion
 - `GET /v1/models` - List available models
 - `GET /health` - Service health check
- `WebSocket /ws` - Streaming inference
+- `WebSocket /ws` - Streaming inference (planned)
 ### Deployment Vision
 - **Static binaries** - Single file deployment, no dependencies
@ -72,56 +88,55 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 - **Serverless functions** - Minimal cold start with static linking
 - **WebAssembly** - Browser inference without additional runtime
-## Implementation Plan
+## Implementation Plan Status
-### Phase 1: Foundation ✅ **DRAFTED**
+### Phase 1: Foundation ✅ **DRAFT COMPLETE**
 - [x] Set up Zig project structure
 - [x] Implement basic tensor operations with SIMD
 - [x] Create memory management system (arena allocators)
 - [x] Build HTTP server framework
 - [x] **Apple Silicon detection via sysctl calls**
 - [x] **Updated to Zig 0.15.0-dev - compiles cleanly**
 - [x] **Benchmark suite** showing current performance
-### Phase 2: Core Model
+*📈 Performance baseline established - see [benchmarks](experimental/README.md#benchmarks)*
 ### Phase 2: Core Model (IN PROGRESS)
 - [ ] Implement transformer layers
 - [ ] Add Multi-Head Latent Attention (MLA)
 - [ ] Build Mixture of Experts (MoE) routing
 - [ ] Create tokenizer integration
-### Phase 3: Backends
+### Phase 3: Backends (PLANNED)
 - [ ] Optimize CPU backend with AVX/NEON
 - [ ] Integrate Metal for Apple Silicon
 - [ ] Add CUDA support for NVIDIA GPUs
 - [ ] Implement WebGPU for browsers
-### Phase 4: Web Integration
+### Phase 4: Web Integration (DRAFT STRUCTURE)
 - [x] Complete HTTP API implementation (basic structure)
 - [ ] Add WebSocket streaming
 - [ ] Build authentication/rate limiting
 - [ ] Create deployment tooling
 ## Expected Benefits
 | Aspect | Current (PyTorch) | Proposed (Zig) |
 |--------|------------------|----------------|
 | Cold start | 10-30s | **< 2s** |
 | Memory usage | 20-40GB | **< 16GB** |
 | Dependencies | ~2GB runtime | **Single binary** |
 | Deployment | Complex | **Copy & run** |
 ## Technical Challenges
 - **Model Complexity**: DeepSeek V3's MoE architecture requires careful memory management
 - **Backend Integration**: Need efficient FFI to CUDA/Metal while maintaining performance
 - **Web Scale**: Handle concurrent requests without blocking inference
 - **Accuracy**: Match PyTorch numerical precision
 - **Performance**: Current implementation is 1000x slower than optimised BLAS - major optimization needed
 ## Platform-Specific Opportunities
-### Apple Silicon (M-Series)
+### Apple Silicon (M-Series) ✅ **Draft Detection Implemented**
 - **Metal Performance Shaders** integration for matrix operations
 - **AMX instruction set** access for accelerated linear algebra
 - **Unified memory architecture** exploitation for zero-copy transfers
 - **Power efficiency tuning** across P and E cores
 - **✅ Proper M1/M2/M3/M4 detection** via system calls
 *Current status: Hardware detection working, GPU acceleration not yet implemented.*
 ### x86_64 Architecture
 - **AVX-512 vectorization** with masked operations
@ -137,39 +152,29 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 ## Getting Started
-**Current Status**: This repository contains the original Python DeepSeek V3 implementation. The Zig implementation is proposed future work.
+**Current Status**: This repository contains a **DRAFT EXPERIMENTAL** Zig implementation foundation. 
-### For the Current Python Implementation:
+### For the Current Zig Implementation:
 ```bash
 # Clone this repository
-git clone https://github.com/[current-repo-path]
+git clone https://github.com/Triex/DeepZig-V3
-cd DeepSeek-V3-Zig
+cd DeepSeek-V3-Zig/experimental
-# Follow existing Python setup instructions
+# Build and test the foundation
-# (see original DeepSeek V3 documentation)
+zig build
 ```
-### For the Proposed Zig Implementation:
+# Run the HTTP server (basic structure)
-```bash
+zig build run -- --port 8080
 # This would be the future workflow once implemented:
-# 1. Set up new Zig project structure
+# Run benchmarks (see actual performance)
 zig init-exe deepseek-v3-zig
 # 2. Implement core components
 # - Tensor operations with SIMD
 # - HTTP server framework  
 # - Model architecture
 # 3. Test and benchmark
 zig build test
 zig build bench
-# 4. Run web server
+# Test Apple Silicon detection
-zig build run -- --port 8080
+zig build-exe src/test_m_series.zig -I src -lc -framework Metal -framework Foundation
 ./test_m_series
 ```
-**Want to contribute to making this real?** See [Seeking Contributors](#seeking-contributors) below.
+**📊 Performance Reality Check**: See [experimental/README.md](experimental/README.md) for actual benchmark results showing current performance limitations and optimisation opportunities.
 ## Development Approach
@ -183,38 +188,27 @@ Reference: [Zig Cookbook](https://zigcc.github.io/zig-cookbook/) for implementat
 ## Seeking Contributors
-This is an ambitious project that would benefit from expertise in:
+This is an ambitious **DRAFT project** that would benefit from expertise in:
 - **Performance optimization** (current bottleneck: naive matrix operations)
 - **Zig systems programming**
 - **GPU kernel optimization** (CUDA/Metal)
 - **ML model implementation**
 - **Web server development**
 - **Performance optimization**
 - **Hardware-software co-design**
 - **Novel inference techniques** (Speculative decoding, quantization)
-## Project Timeline
+## Current Limitations & Next Steps
- Foundation and basic tensor ops
+**🚧 What's Working**: Compiles, runs, measures performance  
- Core transformer implementation  
+**⚠️ What's Missing**: Optimized algorithms, robust flows, actual DeepSeek V3 model  
- Backend optimization and web API
+**📊 Performance Gap**: 1000x slower than production systems  
- Testing, benchmarking, deployment tools
+**🎯 Next Priority**: BLAS integration and GPU acceleration  
-## Key Questions
+See [experimental implementation](experimental/) for technical details and current benchmarks.
 **Q: Why not just optimize PyTorch?**  
 A: PyTorch's Python overhead and GC pauses are fundamental limitations. Zig offers zero-cost abstractions, superior error handling, and deterministic performance.
 **Q: How will this compare to llama.cpp?**  
 A: Similar performance goals, but with built-in web API, better memory management, and focus on DeepSeek V3's specific MoE architecture.
 **Q: What about ONNX/TensorRT/ZML etc?**  
 A: Those are inference runtimes, not development frameworks / LLM frameworks. This project enables rapid iteration and custom optimization for research.
 ---
 ## References
- [DeepZig V3 (Experimental Start)](https://github.com/Triex/DeepZig-V3/tree/main/experimental)
+- [DeepZig V3 (Experimental Implementation)](experimental/) - **Current working code**
 - [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437) - Original model architecture
 - [Zig Language](https://ziglang.org/) - Language documentation
 - [Awesome Zig](https://github.com/C-BJ/awesome-zig) - Community resources
@ -225,5 +219,7 @@ A: Those are inference runtimes, not development frameworks / LLM frameworks. Th
 ---
-**Status**: 🎯 Seeking feedback & idea expansion<br/>
+**Status**: 🎯 **EXPERIMENTAL DRAFT** - Foundation compiles and runs basic operations ([see benchmarks](experimental/README.md#benchmarks))<br/>
 **Vision**: Foundation for advanced AI reasoning research
 **⚠️ Important**: This is a **research/development foundation** with draft/base implementations. Not ready for production use.
--- a/experimental/README.md
+++ b/experimental/README.md
@ -9,6 +9,7 @@ A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/)
 > - ✅ **SIMD-optimized tensor operations** (AVX2, NEON)
 > - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
 > - ✅ **Memory management** and backend architecture
 > - ✅ **Apple Silicon detection via sysctl calls**
 > 
 > **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.<br/>
 > **Performance Note**: Current implementation uses naive algorithms - matrix multiplication is ~1000x slower than optimized BLAS. See [benchmarks](#benchmarks) below.<br/>
@ -25,6 +26,8 @@ This experimental implementation aims to leverage Zig's unique advantages for sy
 - **Single binary deployment** with no runtime dependencies
 - **Cross-platform compilation** for multiple architectures
 **🔗 Related**: See the [main project README](../README.md) for architecture overview and vision.
 ## Project Structure
 ```
@ -230,20 +233,26 @@ Run benchmarks to measure performance:
 zig build bench
 ```
 **Hardware Context**: Benchmarks run on Apple M1 MacBook Pro (MacBookPro17,1) with 16GB unified memory, Zig 0.15.0-dev.703+597dd328e, debug build.
 Example output:
 ```
 🚀 DeepZig V3 Performance Benchmarks
 ==========================================
 Backend: CPU (SIMD optimized)
-Architecture: x86_64
+Architecture: aarch64  
-Thread count: 16
+Thread count: 8
 Hardware: Apple M1 MacBook Pro, 16GB unified memory
 Operation                      | Iterations |  Avg Time | Operations/s | Memory
 -------------------------------|------------|-----------|--------------|-------
 Tensor Creation (1024x1024)    |   1000 iter |     2.03 ms |        493 ops/s |   4.0 MB
 Tensor Addition (SIMD)         |    100 iter |     1.49 ms | 2806962690 ops/s |  48.0 MB  
-Matrix Multiplication          |     10 iter |  6418.08 ms |          0 GFLOPS |  12.0 MB
+Matrix Multiplication          |     10 iter |  6418.08 ms |         0 GFLOPS |  12.0 MB
 SwiGLU Activation              |   1000 iter |     4.44 ms |  236002478 ops/s |   12.0 MB
 RMS Normalization (SIMD)       |   1000 iter |     0.00 ms |    1077586 ops/s |    0.0 MB
 Memory Bandwidth               |    100 iter |     4.92 ms |         13 ops/s |  128.0 MB
 ```
 ## Known Issues
--- a/experimental/src/backends/metal/device.zig
+++ b/experimental/src/backends/metal/device.zig
@ -0,0 +1,306 @@
 // Metal Device detection and handling for Apple Silicon
 // Specifically optimized for M-series chips using proper system detection
 const std = @import("std");
 const Allocator = std.mem.Allocator;
 const c = std.c;
 // Device information structure
 pub const MetalDeviceInfo = struct {
    device_name: []const u8,
    is_apple_silicon: bool,
    is_m_series: bool,
    series_generation: u8, // 1 = M1, 2 = M2, 3 = M3, etc.
    variant: []const u8, // "Pro", "Max", "Ultra", etc.
    unified_memory_size: u64,
    has_anc: bool, // Apple Neural Engine
    pub fn format(
        self: @This(),
        comptime fmt: []const u8,
        options: std.fmt.FormatOptions,
        writer: anytype,
    ) !void {
        _ = fmt;
        _ = options;
        try writer.print("Metal Device: {s} ({s}{d} {s})", .{
            self.device_name,
            if (self.is_m_series) "M" else "",
            if (self.is_m_series) self.series_generation else 0,
            if (self.is_m_series) self.variant else "",
        });
        try writer.print("\nUnified Memory: {} GB", .{self.unified_memory_size / (1024 * 1024 * 1024)});
        try writer.print("\nApple Neural Engine: {}", .{if (self.has_anc) "Available" else "Not Available"});
    }
 };
 // M-series chip information
 const MSeriesInfo = struct {
    is_m_series: bool,
    generation: u8,
    variant: []const u8,
 };
 // System detection using sysctl
 const SysctlError = error{
    NotFound,
    BufferTooSmall,
    SystemError,
 };
 /// Get sysctl string value
 fn getSysctlString(allocator: Allocator, name: []const u8) ![]const u8 {
    // Only available on macOS
    if (@import("builtin").os.tag != .macos) {
        return SysctlError.NotFound;
    }
    var size: usize = 0;
    // First, get the size needed
    const name_cstr = try allocator.dupeZ(u8, name);
    defer allocator.free(name_cstr);
    if (c.sysctlbyname(name_cstr.ptr, null, &size, null, 0) != 0) {
        return SysctlError.NotFound;
    }
    // Allocate buffer and get the actual value
    const buffer = try allocator.alloc(u8, size);
    defer allocator.free(buffer);
    if (c.sysctlbyname(name_cstr.ptr, buffer.ptr, &size, null, 0) != 0) {
        return SysctlError.SystemError;
    }
    // Return a copy of the string (minus null terminator if present)
    const len = if (size > 0 and buffer[size - 1] == 0) size - 1 else size;
    return try allocator.dupe(u8, buffer[0..len]);
 }
 /// Get sysctl integer value
 fn getSysctlInt(comptime T: type, name: []const u8, allocator: Allocator) !T {
    if (@import("builtin").os.tag != .macos) {
        return SysctlError.NotFound;
    }
    var value: T = 0;
    var size: usize = @sizeOf(T);
    const name_cstr = try allocator.dupeZ(u8, name);
    defer allocator.free(name_cstr);
    if (c.sysctlbyname(name_cstr.ptr, &value, &size, null, 0) != 0) {
        return SysctlError.NotFound;
    }
    return value;
 }
 /// Check if running under Rosetta 2 translation
 fn isRunningUnderRosetta(allocator: Allocator) bool {
    const result = getSysctlInt(i32, "sysctl.proc_translated", allocator) catch return false;
    return result == 1;
 }
 /// Check if hardware supports ARM64 (Apple Silicon)
 fn isAppleSiliconHardware(allocator: Allocator) bool {
    // Check for ARM64 support
    const arm64_support = getSysctlInt(i32, "hw.optional.arm64", allocator) catch return false;
    if (arm64_support == 1) return true;
    // Alternative check: CPU architecture
    if (@import("builtin").target.cpu.arch == .aarch64) return true;
    // If running under Rosetta, we're on Apple Silicon
    return isRunningUnderRosetta(allocator);
 }
 /// Parse M-series information from CPU brand string
 fn parseMSeriesInfo(cpu_brand: []const u8) MSeriesInfo {
    // Default values
    var result = MSeriesInfo{
        .is_m_series = false,
        .generation = 0,
        .variant = "",
    };
    // Look for Apple M pattern
    if (std.mem.indexOf(u8, cpu_brand, "Apple M") == null) {
        return result;
    }
    result.is_m_series = true;
    // Extract generation and variant from CPU brand string
    // Examples: "Apple M1", "Apple M1 Pro", "Apple M1 Max", "Apple M1 Ultra"
    if (std.mem.indexOf(u8, cpu_brand, "M1")) |_| {
        result.generation = 1;
        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
            result.variant = "Pro";
        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
            result.variant = "Max";
        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
            result.variant = "Ultra";
        } else {
            // Just "Apple M1" - this is the regular M1
            result.variant = "";
        }
    } else if (std.mem.indexOf(u8, cpu_brand, "M2")) |_| {
        result.generation = 2;
        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
            result.variant = "Pro";
        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
            result.variant = "Max";
        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
            result.variant = "Ultra";
        } else {
            result.variant = "";
        }
    } else if (std.mem.indexOf(u8, cpu_brand, "M3")) |_| {
        result.generation = 3;
        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
            result.variant = "Pro";
        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
            result.variant = "Max";
        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
            result.variant = "Ultra";
        } else {
            result.variant = "";
        }
    } else if (std.mem.indexOf(u8, cpu_brand, "M4")) |_| {
        result.generation = 4;
        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
            result.variant = "Pro";
        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
            result.variant = "Max";
        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
            result.variant = "Ultra";
        } else {
            result.variant = "";
        }
    }
    return result;
 }
 /// Try to detect GPU configuration for more detailed chip identification
 fn detectGPUCores(allocator: Allocator) u32 {
    // Try to get GPU core count - this can help distinguish variants
    // Regular M1: 7-8 GPU cores
    // M1 Pro: 14-16 GPU cores  
    // M1 Max: 24-32 GPU cores
    // This is a placeholder - actual implementation would query Metal API
    // For now, return 0 to indicate unknown
    _ = allocator;
    return 0;
 }
 /// Detect Apple Silicon and M-series chip capabilities using proper system detection
 pub fn detectAppleSilicon(allocator: Allocator) !MetalDeviceInfo {
    // Check at compile-time if we're on macOS
    const is_macos = @import("builtin").os.tag == .macos;
    if (!is_macos) {
        return MetalDeviceInfo{
            .device_name = try allocator.dupe(u8, "Non-macOS Device"),
            .is_apple_silicon = false,
            .is_m_series = false,
            .series_generation = 0,
            .variant = try allocator.dupe(u8, ""),
            .unified_memory_size = 0,
            .has_anc = false,
        };
    }
    // Detect Apple Silicon hardware
    const is_apple_silicon = isAppleSiliconHardware(allocator);
    if (!is_apple_silicon) {
        return MetalDeviceInfo{
            .device_name = try allocator.dupe(u8, "Intel Mac"),
            .is_apple_silicon = false,
            .is_m_series = false,
            .series_generation = 0,
            .variant = try allocator.dupe(u8, ""),
            .unified_memory_size = 0,
            .has_anc = false,
        };
    }
    // Get CPU brand string for M-series detection - this is the authoritative source
    const cpu_brand = getSysctlString(allocator, "machdep.cpu.brand_string") catch "Apple Silicon";
    defer allocator.free(cpu_brand);
    std.log.debug("CPU Brand String: '{s}'", .{cpu_brand});
    // Parse M-series information from the actual CPU brand string
    const m_info = parseMSeriesInfo(cpu_brand);
    // Get additional hardware details for logging/debugging
    const hw_model = getSysctlString(allocator, "hw.model") catch "";
    defer if (hw_model.len > 0) allocator.free(hw_model);
    const gpu_cores = detectGPUCores(allocator);
    if (gpu_cores > 0) {
        std.log.debug("GPU Cores: {}", .{gpu_cores});
    }
    std.log.debug("Hardware Model: '{s}'", .{hw_model});
    std.log.debug("Detected M{d} {s}", .{ m_info.generation, m_info.variant });
    // Get system memory
    const memory_size = getSysctlInt(u64, "hw.memsize", allocator) catch (16 * 1024 * 1024 * 1024); // Default 16GB
    // Get device name
    const device_name = getSysctlString(allocator, "hw.model") catch "Apple Silicon Mac";
    return MetalDeviceInfo{
        .device_name = device_name, // This will be owned by the caller
        .is_apple_silicon = true,
        .is_m_series = m_info.is_m_series,
        .series_generation = m_info.generation,
        .variant = try allocator.dupe(u8, m_info.variant), // Duplicate to ensure consistent allocation
        .unified_memory_size = memory_size,
        .has_anc = m_info.is_m_series, // All M-series have Apple Neural Engine
    };
 }
 /// Get optimal GPU parameters for detected device
 pub fn getOptimalWorkGroupSize() u32 {
    // These are reasonable defaults that should work well on most Apple GPU architectures
    // In a real implementation, we would query Metal API for the actual optimal values
    if (@import("builtin").target.cpu.arch == .aarch64) {
        // Apple Silicon optimized values based on GPU core count
        return 128;
    }
    // Default for Intel Macs and others
    return 64;
 }
 /// Get recommended memory allocation strategy based on device capabilities
 pub fn getMemoryStrategy() enum { UnifiedMemory, DiscreteMemory } {
    // Check if we're on Apple Silicon hardware (even under Rosetta)
    if (@import("builtin").os.tag == .macos) {
        var gpa = std.heap.GeneralPurposeAllocator(.{}){};
        defer _ = gpa.deinit();
        const allocator = gpa.allocator();
        if (isAppleSiliconHardware(allocator)) {
            return .UnifiedMemory; // Apple Silicon uses unified memory
        }
    }
    // For Intel Macs and other platforms
    return .DiscreteMemory;
 }
 /// Get optimal tensor block size for current device
 pub fn getOptimalTensorBlockSize() u32 {
    if (@import("builtin").target.cpu.arch == .aarch64) {
        // Apple Silicon has more GPU cores and benefits from larger blocks
        return 256;
    } else {
        return 128;
    }
 }
--- a/experimental/src/backends/metal/memory.zig
+++ b/experimental/src/backends/metal/memory.zig
@ -0,0 +1,152 @@
 // Metal-specific memory management for Apple Silicon
 // Optimized for the unified memory architecture of M-series chips
 const std = @import("std");
 const Allocator = std.mem.Allocator;
 const device = @import("device.zig");
 const MetalDeviceInfo = device.MetalDeviceInfo;
 /// Memory modes available for Metal buffers
 pub const MetalMemoryMode = enum {
    /// Shared between CPU and GPU with automatic migration
    Shared,
    /// Managed with separate CPU and GPU views but synchronized
    Managed,
    /// GPU-only storage for maximum performance
    Private,
    /// Memory visible to both CPU and GPU (Apple Silicon only)
    Unified,
 };
 /// Buffer usage patterns to optimize memory allocation
 pub const MetalBufferUsage = enum {
    /// Read often by GPU
    GpuRead,
    /// Write often by GPU
    GpuWrite,
    /// Read/write by both CPU and GPU
    Shared,
    /// Used only temporarily for a single operation
    Transient,
 };
 /// Memory manager for optimal Metal buffer allocation on M-series chips
 pub const MetalMemoryManager = struct {
    allocator: Allocator,
    device_info: ?MetalDeviceInfo,
    total_allocated: usize,
    max_allocation: usize,
    const Self = @This();
    /// Create a new Metal memory manager
    pub fn init(allocator: Allocator, device_info: ?MetalDeviceInfo) Self {
        return Self{
            .allocator = allocator,
            .device_info = device_info,
            .total_allocated = 0,
            .max_allocation = 0,
        };
    }
    /// Clean up any resources
    pub fn deinit(self: *Self) void {
        // Release any cached buffers or other resources
        _ = self;
    }
    /// Get the optimal memory mode based on device capabilities and usage pattern
    pub fn getOptimalMemoryMode(self: *Self, usage: MetalBufferUsage) MetalMemoryMode {
        // If we're on Apple Silicon, we can use unified memory
        const is_apple_silicon = self.device_info != null and self.device_info.?.is_apple_silicon;
        if (is_apple_silicon) {
            return switch (usage) {
                .GpuRead => .Unified,
                .GpuWrite => .Unified,
                .Shared => .Unified,
                .Transient => .Private, // Even on unified memory, transient data is better in private
            };
        } else {
            // On Intel Macs with discrete GPU
            return switch (usage) {
                .GpuRead => .Managed,
                .GpuWrite => .Private,
                .Shared => .Managed,
                .Transient => .Private,
            };
        }
    }
    /// Get recommended allocation size (aligned to device preferences)
    pub fn getOptimalAllocationSize(self: *Self, requested_size: usize) usize {
        // M-series chips prefer certain memory alignment patterns
        const alignment: usize = if (self.device_info != null and self.device_info.?.is_m_series) 
            16 * 1024 // 16KB alignment on M-series
        else 
            4 * 1024; // 4KB on other devices
        return std.mem.alignForward(usize, requested_size, alignment);
    }
    /// Track memory allocations for monitoring
    pub fn trackAllocation(self: *Self, size: usize) void {
        self.total_allocated += size;
        self.max_allocation = std.math.max(self.max_allocation, self.total_allocated);
    }
    /// Track memory deallocations
    pub fn trackDeallocation(self: *Self, size: usize) void {
        if (self.total_allocated >= size) {
            self.total_allocated -= size;
        } else {
            self.total_allocated = 0;
        }
    }
    /// Get memory usage statistics
    pub fn getMemoryStats(self: *Self) struct { 
        current: usize,
        peak: usize,
        device_total: usize,
    } {
        const device_total = if (self.device_info != null) 
            self.device_info.?.unified_memory_size 
        else 
            0;
        return .{
            .current = self.total_allocated,
            .peak = self.max_allocation,
            .device_total = device_total,
        };
    }
    /// Get recommended buffer storage mode string for Metal API
    pub fn getStorageModeString(mode: MetalMemoryMode) []const u8 {
        return switch (mode) {
            .Shared => "MTLStorageModeShared",
            .Managed => "MTLStorageModeManaged",
            .Private => "MTLStorageModePrivate",
            .Unified => "MTLStorageModeShared", // Unified uses Shared on the API level
        };
    }
 };
 /// Helper to determine if hazard tracking should be enabled based on device capabilities
 pub fn shouldUseHazardTracking(device_info: ?MetalDeviceInfo) bool {
    if (device_info == null) return false;
    // M3 and newer have better hazard tracking hardware
    if (device_info.?.is_m_series and device_info.?.series_generation >= 3) {
        return true;
    }
    return false;
 }
--- a/experimental/src/backends/metal/root.zig
+++ b/experimental/src/backends/metal/root.zig
@ -4,12 +4,18 @@
 const std = @import("std");
 const deepseek_core = @import("deepseek_core");
 const Allocator = std.mem.Allocator;
 const metal_device = @import("device.zig");
 const MetalDeviceInfo = metal_device.MetalDeviceInfo;
 /// Metal backend implementation for Apple Silicon
 pub const MetalBackend = struct {
    allocator: Allocator,
    device_available: bool,
    unified_memory_size: u64,
    device_info: ?MetalDeviceInfo,
    optimal_work_group_size: u32,
    memory_strategy: metal_device.getMemoryStrategy(),
    tensor_block_size: u32,
    const Self = @This();
@ -17,10 +23,37 @@ pub const MetalBackend = struct {
        // Check if Metal is available (compile-time check for macOS)
        const metal_available = @import("builtin").os.tag == .macos;
        var device_info: ?MetalDeviceInfo = null;
        var unified_memory_size: u64 = 0;
        var optimal_work_group_size: u32 = 64; // Default
        var tensor_block_size: u32 = 128; // Default
        if (metal_available) {
-            std.log.info("Metal Backend initialized on Apple Silicon");
+            // Detect Apple Silicon and M-series capabilities
-            // TODO: Initialize MTLDevice and command queue
+            device_info = try metal_device.detectAppleSilicon(allocator);
-            // TODO: Query unified memory size
+            unified_memory_size = device_info.?.unified_memory_size;
            optimal_work_group_size = metal_device.getOptimalWorkGroupSize();
            tensor_block_size = metal_device.getOptimalTensorBlockSize();
            std.log.info("Metal Backend initialized on {s}", .{device_info.?.device_name});
            // Log detailed device information
            if (device_info.?.is_apple_silicon) {
                if (device_info.?.is_m_series) {
                    std.log.info("Detected M{d} {s} with {d}GB unified memory", 
                        .{
                            device_info.?.series_generation, 
                            device_info.?.variant,
                            unified_memory_size / (1024 * 1024 * 1024),
                        }
                    );
                } else {
                    std.log.info("Detected Apple Silicon (non-M series) with {d}GB unified memory", 
                        .{unified_memory_size / (1024 * 1024 * 1024)}
                    );
                }
            } else {
                std.log.warn("Metal is available but not running on Apple Silicon");
            }
        } else {
            std.log.warn("Metal Backend not available on this platform");
        }
@ -28,7 +61,11 @@ pub const MetalBackend = struct {
        return Self{
            .allocator = allocator,
            .device_available = metal_available,
-            .unified_memory_size = if (metal_available) 16 * 1024 * 1024 * 1024 else 0, // 16GB default
+            .unified_memory_size = unified_memory_size,
            .device_info = device_info,
            .optimal_work_group_size = optimal_work_group_size,
            .memory_strategy = metal_device.getMemoryStrategy(),
            .tensor_block_size = tensor_block_size,
        };
    }
@ -54,14 +91,93 @@ pub const MetalBackend = struct {
            c.shape.dims[0], c.shape.dims[1]
        });
        // Check if we're on Apple Silicon M series for optimized path
        if (self.device_info != null and self.device_info.?.is_m_series) {
            std.log.debug("Using optimized M{d} {s} matrix multiplication", 
                .{
                    self.device_info.?.series_generation, 
                    self.device_info.?.variant
                }
            );
            // Select appropriate implementation based on M series generation
            switch (self.device_info.?.series_generation) {
                3 => return try self.matmulM3(a, b, c), // M3 optimized path
                2 => return try self.matmulM2(a, b, c), // M2 optimized path
                1 => return try self.matmulM1(a, b, c), // M1 optimized path
                else => {} // Fall through to generic implementation
            }
        }
        // TODO: Implement actual Metal compute shader
        // This would involve:
        // 1. Create MTLBuffer from tensor data
        // 2. Set up compute pipeline with matmul shader
-        // 3. Dispatch compute commands
+        // 3. Dispatch compute commands with optimized workgroup size based on device
        // 4. Copy results back to tensor
        // For now, fallback to CPU implementation
        std.log.warn("Falling back to CPU implementation, Metal not implemented");
        return error.NotImplemented;
    }
    /// M1-optimized matrix multiplication
    fn matmulM1(
        self: *Self,
        a: *deepseek_core.Tensor,
        b: *const deepseek_core.Tensor,
        c: *deepseek_core.Tensor,
    ) !void {
        _ = self;
        _ = a;
        _ = b;
        _ = c;
        // TODO: M1-specific optimizations
        // - Use MPSMatrixMultiplication with M1-specific parameters
        // - Optimize for 7/8 GPU cores typically found in M1
        // - Account for unified memory bandwidth on M1
        return error.NotImplemented;
    }
    /// M2-optimized matrix multiplication
    fn matmulM2(
        self: *Self,
        a: *deepseek_core.Tensor,
        b: *const deepseek_core.Tensor,
        c: *deepseek_core.Tensor,
    ) !void {
        _ = self;
        _ = a;
        _ = b;
        _ = c;
        // TODO: M2-specific optimizations
        // - Use MPSMatrixMultiplication with M2-specific parameters
        // - Optimize for 8/10 GPU cores typically found in M2
        // - Account for increased memory bandwidth on M2
        return error.NotImplemented;
    }
    /// M3-optimized matrix multiplication
    fn matmulM3(
        self: *Self,
        a: *deepseek_core.Tensor,
        b: *const deepseek_core.Tensor,
        c: *deepseek_core.Tensor,
    ) !void {
        _ = self;
        _ = a;
        _ = b;
        _ = c;
        // TODO: M3-specific optimizations
        // - Use MPSMatrixMultiplication with M3-specific parameters
        // - Optimize for 10/16 GPU cores typically found in M3
        // - Account for dynamic core switching on M3
        return error.NotImplemented;
    }
@ -77,16 +193,59 @@ pub const MetalBackend = struct {
            return error.MetalNotAvailable;
        }
-        _ = input;
+        std.log.debug("Metal RMS normalization with {} elements", .{input.len});
        // Check if we're on Apple Silicon M series for optimized path
        if (self.device_info != null and self.device_info.?.is_m_series) {
            std.log.debug("Using optimized M{d} {s} RMS normalization", 
                .{
                    self.device_info.?.series_generation, 
                    self.device_info.?.variant
                }
            );
            // Select optimal workgroup size based on M series generation
            const workgroup_size = switch (self.device_info.?.series_generation) {
                3 => 256, // M3 has more GPU cores
                2 => 192, // M2 optimization
                else => 128, // M1 and others
            };
            // Determine if we should use unified memory approach
            const use_unified_memory = self.memory_strategy == .UnifiedMemory;
            // Calculate optimal thread count based on input size and GPU cores
            const thread_count = std.math.min(
                std.math.alignForward(usize, input.len, workgroup_size),
                workgroup_size * 1024 // Maximum reasonable thread count
            );
            std.log.debug("RMS Norm using workgroup size: {}, threads: {}", 
                .{workgroup_size, thread_count});
            // TODO: Implement Metal compute shader for RMS norm with M-series optimizations
            // 1. Create buffers (potentially using managed storage mode for unified memory)
            // 2. Set up compute pipeline with RMS norm shader
            // 3. Dispatch compute with optimal work group size
            // 4. Handle results with zero-copy when possible on unified memory
            if (!use_unified_memory) {
                // Would handle non-unified memory path differently
                std.log.debug("Using discrete memory path");
            }
            // thread_count is used in the log message above, don't discard it
        }
        // TODO: Complete implementation of Metal compute shader for RMS norm
        // Metal excels at parallel operations like normalization
        // Don't discard input since it's used above for thread_count calculation
        // Only discard these if not used above
        _ = weight;
        _ = output;
        _ = eps;
        std.log.debug("Metal RMS normalization");
        // TODO: Implement Metal compute shader for RMS norm
        // Metal excels at parallel operations like normalization
        return error.NotImplemented;
    }
--- a/experimental/src/backends/metal/shader.zig
+++ b/experimental/src/backends/metal/shader.zig
@ -0,0 +1,254 @@
 // Metal shader utility for managing and optimizing Metal shaders
 // With specific optimizations for M-series Apple Silicon
 const std = @import("std");
 const Allocator = std.mem.Allocator;
 const device = @import("device.zig");
 const MetalDeviceInfo = device.MetalDeviceInfo;
 /// Optimization level for Metal shaders
 pub const ShaderOptimizationLevel = enum {
    none,
    default,
    performance,
    size,
    /// Get the recommended optimization level based on device capabilities
    pub fn fromDeviceInfo(device_info: ?MetalDeviceInfo) ShaderOptimizationLevel {
        if (device_info == null) return .default;
        if (device_info.?.is_m_series) {
            // M3 can handle highly optimized shaders
            if (device_info.?.series_generation >= 3) {
                return .performance;
            } 
            // M1/M2 balance between performance and size
            else {
                return .default;
            }
        }
        // For non-Apple Silicon, be more conservative
        return .default;
    }
 };
 /// Metal shader types
 pub const ShaderType = enum {
    compute,
    vertex,
    fragment,
    pub fn toMTLFunctionType(self: ShaderType) []const u8 {
        return switch (self) {
            .compute => "MTLFunctionTypeKernel",
            .vertex => "MTLFunctionTypeVertex",
            .fragment => "MTLFunctionTypeFragment",
        };
    }
 };
 /// Metal shader source with metadata
 pub const ShaderSource = struct {
    name: []const u8,
    source_code: []const u8,
    shader_type: ShaderType,
    /// Create a shader source with a given name and code
    pub fn init(name: []const u8, source_code: []const u8, shader_type: ShaderType) ShaderSource {
        return .{
            .name = name,
            .source_code = source_code,
            .shader_type = shader_type,
        };
    }
 };
 /// Metal shader compilation options including M-series specific optimizations
 pub const ShaderCompileOptions = struct {
    optimization_level: ShaderOptimizationLevel,
    fast_math: bool,
    preserve_invariance: bool,
    /// Create default options for a specific device
    pub fn forDevice(device_info: ?MetalDeviceInfo) ShaderCompileOptions {
        const opt_level = ShaderOptimizationLevel.fromDeviceInfo(device_info);
        // M-series chips benefit from fast math but some algorithms require precision
        const fast_math = device_info != null and 
                         device_info.?.is_m_series and 
                         device_info.?.series_generation >= 2;
        return .{
            .optimization_level = opt_level,
            .fast_math = fast_math,
            .preserve_invariance = false,
        };
    }
 };
 /// Utility for managing Metal shader compilation and caching
 pub const ShaderManager = struct {
    allocator: Allocator,
    device_info: ?MetalDeviceInfo,
    compile_options: ShaderCompileOptions,
    const Self = @This();
    /// Create a new shader manager
    pub fn init(
        allocator: Allocator, 
        device_info: ?MetalDeviceInfo
    ) Self {
        return Self{
            .allocator = allocator,
            .device_info = device_info,
            .compile_options = ShaderCompileOptions.forDevice(device_info),
        };
    }
    /// Clean up resources
    pub fn deinit(self: *Self) void {
        _ = self;
    }
    /// Get optimal threadgroup size for a compute shader on current device
    pub fn getOptimalThreadgroupSize(self: *Self) struct { x: u32, y: u32, z: u32 } {
        if (self.device_info == null or !self.device_info.?.is_apple_silicon) {
            return .{ .x = 8, .y = 8, .z = 1 };
        }
        // M-series chips have different optimal sizes
        if (self.device_info.?.is_m_series) {
            return switch (self.device_info.?.series_generation) {
                3 => .{ .x = 16, .y = 16, .z = 1 }, // M3 has more GPU cores
                2 => .{ .x = 16, .y = 8, .z = 1 },  // M2 
                else => .{ .x = 8, .y = 8, .z = 1 }, // M1
            };
        }
        return .{ .x = 8, .y = 8, .z = 1 };
    }
    /// Get memory barrier type based on hardware capabilities
    pub fn getOptimalBarrierType(self: *Self) []const u8 {
        // Newer M-series chips support more efficient memory barriers
        if (self.device_info != null and 
            self.device_info.?.is_m_series and 
            self.device_info.?.series_generation >= 2) {
            return "MTLBarrierScopeBuffers";
        }
        return "MTLBarrierScopeTextures | MTLBarrierScopeBuffers";
    }
    /// Generate compilation options string for Metal API
    pub fn getCompileOptionsString(self: *Self) []const u8 {
        _ = self;
        // In a real implementation, this would return Objective-C code to set up
        // MTLCompileOptions with the appropriate parameters
        return "MTLCompileOptions"; // Placeholder
    }
 };
 /// Create optimized Metal shaders for key operations based on device capabilities
 pub fn createOptimizedMetalShaders(device_info: ?MetalDeviceInfo) struct {
    matmul: []const u8,
    rms_norm: []const u8,
    swiglu: []const u8,
    attention: []const u8,
 } {
    // Base versions of shaders
    const base_matmul_shader = 
        \\#include <metal_stdlib>
        \\using namespace metal;
        \\
        \\kernel void matmul_kernel(
        \\    device const float* a [[buffer(0)]],
        \\    device const float* b [[buffer(1)]],
        \\    device float* c [[buffer(2)]],
        \\    constant uint& M [[buffer(3)]],
        \\    constant uint& N [[buffer(4)]],
        \\    constant uint& K [[buffer(5)]],
        \\    uint2 gid [[thread_position_in_grid]]
        \\) {
        \\    if (gid.x >= N || gid.y >= M) return;
        \\    
        \\    float sum = 0.0;
        \\    for (uint k = 0; k < K; k++) {
        \\        sum += a[gid.y * K + k] * b[k * N + gid.x];
        \\    }
        \\    c[gid.y * N + gid.x] = sum;
        \\}
    ;
    const base_rms_norm_shader = 
        \\#include <metal_stdlib>
        \\using namespace metal;
        \\
        \\kernel void rms_norm_kernel(
        \\    device const float* input [[buffer(0)]],
        \\    device const float* weight [[buffer(1)]],
        \\    device float* output [[buffer(2)]],
        \\    constant uint& size [[buffer(3)]],
        \\    constant float& eps [[buffer(4)]],
        \\    uint idx [[thread_position_in_grid]]
        \\) {
        \\    if (idx >= size) return;
        \\    
        \\    // Calculate sum of squares
        \\    float sum_sq = 0.0;
        \\    for (uint i = 0; i < size; i++) {
        \\        float val = input[i];
        \\        sum_sq += val * val;
        \\    }
        \\    
        \\    // RMS normalization
        \\    float rms = sqrt(sum_sq / size + eps);
        \\    output[idx] = input[idx] / rms * weight[idx];
        \\}
    ;
    // Default implementations
    var matmul = base_matmul_shader;
    var rms_norm = base_rms_norm_shader;
    var swiglu = "";  // Placeholder
    var attention = ""; // Placeholder
    // For M-series chips, we can use optimized implementations
    if (device_info != null and device_info.?.is_m_series) {
        // M3 optimizations
        if (device_info.?.series_generation >= 3) {
            // M3 has improved threadgroup memory, use tiled implementation
            matmul = 
                \\#include <metal_stdlib>
                \\using namespace metal;
                \\
                \\kernel void matmul_kernel_optimized_m3(
                \\    device const float* a [[buffer(0)]],
                \\    device const float* b [[buffer(1)]],
                \\    device float* c [[buffer(2)]],
                \\    constant uint& M [[buffer(3)]],
                \\    constant uint& N [[buffer(4)]],
                \\    constant uint& K [[buffer(5)]],
                \\    uint2 gid [[thread_position_in_grid]],
                \\    uint2 tid [[thread_position_in_threadgroup]],
                \\    uint2 tgid [[threadgroup_position_in_grid]]
                \\) {
                \\    // Advanced implementation with tiling and local memory
                \\    // Optimized for M3 architecture
                \\    // ...
                \\}
            ;
            // Similar optimizations for other kernels...
        }
    }
    return .{
        .matmul = matmul,
        .rms_norm = rms_norm,
        .swiglu = swiglu,
        .attention = attention,
    };
 }
--- a/experimental/src/test_m_series.zig
+++ b/experimental/src/test_m_series.zig
@ -0,0 +1,39 @@
 // Test program for M series detection
 const std = @import("std");
 const metal_device = @import("backends/metal/device.zig");
 pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();
    std.log.info("Testing M series detection...", .{});
    // Detect Apple Silicon and M-series capabilities
    const device_info = try metal_device.detectAppleSilicon(allocator);
    defer {
        allocator.free(device_info.device_name);
        allocator.free(device_info.variant);
    }
    std.log.info("Device Info:", .{});
    std.log.info("  Device Name: {s}", .{device_info.device_name});
    std.log.info("  Is Apple Silicon: {}", .{device_info.is_apple_silicon});
    std.log.info("  Is M Series: {}", .{device_info.is_m_series});
    if (device_info.is_m_series) {
        std.log.info("  M Series Generation: {}", .{device_info.series_generation});
        std.log.info("  Variant: {s}", .{device_info.variant});
    }
    std.log.info("  Unified Memory: {} GB", .{device_info.unified_memory_size / (1024 * 1024 * 1024)});
    std.log.info("  Has Apple Neural Engine: {}", .{device_info.has_anc});
    // Test other utility functions
    std.log.info("Optimal Work Group Size: {}", .{metal_device.getOptimalWorkGroupSize()});
    std.log.info("Memory Strategy: {s}", .{@tagName(metal_device.getMemoryStrategy())});
    std.log.info("Optimal Tensor Block Size: {}", .{metal_device.getOptimalTensorBlockSize()});
    std.log.info("Test complete!", .{});
 }
--- a/experimental/zig-out/bin/deepseek-v3-zig
+++ b/experimental/zig-out/bin/deepseek-v3-zig