Merge pull request #2 from Triex:feat-Enhanced-device-detection-handling-initial-metal

Feat-Enhanced-device-detection-handling-initial-metal
2025-07-04 23:41:37 -04:00 · 2025-06-11 17:50:54 +10:00 · 2025-06-11 17:50:54 +10:00 · 24d94f7c21
commit 24d94f7c21
parent 68c4c77600 7b81ea27d7
9 changed files with 1001 additions and 85 deletions
--- a/.gitignore
+++ b/.gitignore
@ -173,3 +173,4 @@ cython_debug/

 # Zig
 experimental/.zig-cache/
+zig-out/
--- a/README.md
+++ b/README.md
@ -16,18 +16,22 @@
 </div>
 <hr />

-<h1 align="center"> DeepZig V3: A High-Performance LLM Architecture</h1>
+# DeepZig V3: A High-Performance LLM Architecture

 ## Overview

-A proposal & foundation for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
+A **DRAFT proposal & foundation** for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.

-**Status Update**: ✅ **Foundation compiles clean theoretical implementation** with Zig 0.15.0-dev, including:
-   HTTP server with modern Zig API
- SIMD-optimized tensor operations  
- Cross-platform backend architecture
- Initial memory management
- Comprehensive build system draft
+**⚠️ Status: EXPERIMENTAL DRAFT** ✅ **Foundation compiles with Zig 0.15.0-dev**, including:
+- ✅ HTTP server framework (basic structure)
+- ✅ SIMD-optimized tensor operations (draft implementation)
+- ✅ Cross-platform backend architecture
+- ✅ Initial memory management
+- ✅ **Apple Silicon M-series detection** (hardware detection via sysctl)
+- ✅ Comprehensive build system draft
+- ⚠️ **NOT PRODUCTION READY** - Draft implementation for research/development
+
+**Performance Note**: Current naive algorithms are ~1000x slower than optimized BLAS. Matrix multiplication: 640ms for 1024×1024. This is expected for a foundational draft implementation. See [experimental benchmarks](experimental/README.md#benchmarks) for detailed performance data.

 ## Why This Matters

@ -37,6 +41,18 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 - **Complex deployment** with heavy runtimes
 - **Platform lock-in** due to dependency complexity

+## Expected Benefits vs Current Reality
+
+| Aspect | Current (PyTorch) | Target (Zig) | **Current Draft** |
+|--------|------------------|--------------|-------------------|
+| Cold start | 10-30s | **< 2s** | *Not measured* |
+| Memory usage | 20-40GB | **< 16GB** | *16GB+ for basic ops* |
+| Dependencies | ~2GB runtime | **Single binary** | ✅ **Single binary** |
+| Deployment | Complex | **Copy & run** | ✅ **Copy & run** |
+| Matrix Mul (1024×1024) | ~1ms (optimized) | **< 1ms** | *6418ms (naive)* |
+
+*See [experimental benchmarks](experimental/README.md#benchmarks) for current performance measurements.*
+
 ## Why Zig?

 **Performance**: Zero-cost abstractions, compile-time optimization, direct hardware access<br/>
@ -56,14 +72,14 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 └─────────────────┘    └──────────────────┘    └─────────────────┘
 ```

-## Proposed Web API
+## Draft Web API Framework

-### Target Endpoints
+### Planned Endpoints (Basic Structure Implemented)
 - `POST /v1/chat/completions` - OpenAI-compatible chat API
 - `POST /v1/completions` - Text completion
 - `GET /v1/models` - List available models
 - `GET /health` - Service health check
- `WebSocket /ws` - Streaming inference
+- `WebSocket /ws` - Streaming inference (planned)

 ### Deployment Vision
 - **Static binaries** - Single file deployment, no dependencies
@ -72,56 +88,55 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 - **Serverless functions** - Minimal cold start with static linking
 - **WebAssembly** - Browser inference without additional runtime

-## Implementation Plan
+## Implementation Plan Status

-### Phase 1: Foundation ✅ **DRAFTED**
+### Phase 1: Foundation ✅ **DRAFT COMPLETE**
 - [x] Set up Zig project structure
 - [x] Implement basic tensor operations with SIMD
 - [x] Create memory management system (arena allocators)
 - [x] Build HTTP server framework
+- [x] **Apple Silicon detection via sysctl calls**
 - [x] **Updated to Zig 0.15.0-dev - compiles cleanly**
+- [x] **Benchmark suite** showing current performance

-### Phase 2: Core Model
+*📈 Performance baseline established - see [benchmarks](experimental/README.md#benchmarks)*
+
+### Phase 2: Core Model (IN PROGRESS)
 - [ ] Implement transformer layers
 - [ ] Add Multi-Head Latent Attention (MLA)
 - [ ] Build Mixture of Experts (MoE) routing
 - [ ] Create tokenizer integration

-### Phase 3: Backends
+### Phase 3: Backends (PLANNED)
 - [ ] Optimize CPU backend with AVX/NEON
 - [ ] Integrate Metal for Apple Silicon
 - [ ] Add CUDA support for NVIDIA GPUs
 - [ ] Implement WebGPU for browsers

-### Phase 4: Web Integration
+### Phase 4: Web Integration (DRAFT STRUCTURE)
 - [x] Complete HTTP API implementation (basic structure)
 - [ ] Add WebSocket streaming
 - [ ] Build authentication/rate limiting
 - [ ] Create deployment tooling

-## Expected Benefits
-
-| Aspect | Current (PyTorch) | Proposed (Zig) |
-|--------|------------------|----------------|
-| Cold start | 10-30s | **< 2s** |
-| Memory usage | 20-40GB | **< 16GB** |
-| Dependencies | ~2GB runtime | **Single binary** |
-| Deployment | Complex | **Copy & run** |
-
 ## Technical Challenges

 - **Model Complexity**: DeepSeek V3's MoE architecture requires careful memory management
 - **Backend Integration**: Need efficient FFI to CUDA/Metal while maintaining performance
 - **Web Scale**: Handle concurrent requests without blocking inference
 - **Accuracy**: Match PyTorch numerical precision
+- **Performance**: Current implementation is 1000x slower than optimised BLAS - major optimization needed

 ## Platform-Specific Opportunities

-### Apple Silicon (M-Series)
+### Apple Silicon (M-Series) ✅ **Draft Detection Implemented**
 - **Metal Performance Shaders** integration for matrix operations
 - **AMX instruction set** access for accelerated linear algebra
 - **Unified memory architecture** exploitation for zero-copy transfers
 - **Power efficiency tuning** across P and E cores
+- **✅ Proper M1/M2/M3/M4 detection** via system calls
+
+*Current status: Hardware detection working, GPU acceleration not yet implemented.*

 ### x86_64 Architecture
 - **AVX-512 vectorization** with masked operations
@ -137,39 +152,29 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:

 ## Getting Started

-**Current Status**: This repository contains the original Python DeepSeek V3 implementation. The Zig implementation is proposed future work.
+**Current Status**: This repository contains a **DRAFT EXPERIMENTAL** Zig implementation foundation. 

-### For the Current Python Implementation:
+### For the Current Zig Implementation:
 ```bash
 # Clone this repository
-git clone https://github.com/[current-repo-path]
-cd DeepSeek-V3-Zig
+git clone https://github.com/Triex/DeepZig-V3
+cd DeepSeek-V3-Zig/experimental

-# Follow existing Python setup instructions
-# (see original DeepSeek V3 documentation)
-```
+# Build and test the foundation
+zig build

-### For the Proposed Zig Implementation:
-```bash
-# This would be the future workflow once implemented:
+# Run the HTTP server (basic structure)
+zig build run -- --port 8080

-# 1. Set up new Zig project structure
-zig init-exe deepseek-v3-zig
-
-# 2. Implement core components
-# - Tensor operations with SIMD
-# - HTTP server framework  
-# - Model architecture
-
-# 3. Test and benchmark
-zig build test
+# Run benchmarks (see actual performance)
 zig build bench

-# 4. Run web server
-zig build run -- --port 8080
+# Test Apple Silicon detection
+zig build-exe src/test_m_series.zig -I src -lc -framework Metal -framework Foundation
+./test_m_series
 ```

-**Want to contribute to making this real?** See [Seeking Contributors](#seeking-contributors) below.
+**📊 Performance Reality Check**: See [experimental/README.md](experimental/README.md) for actual benchmark results showing current performance limitations and optimisation opportunities.

 ## Development Approach

@ -183,38 +188,27 @@ Reference: [Zig Cookbook](https://zigcc.github.io/zig-cookbook/) for implementat

 ## Seeking Contributors

-This is an ambitious project that would benefit from expertise in:
+This is an ambitious **DRAFT project** that would benefit from expertise in:
+- **Performance optimization** (current bottleneck: naive matrix operations)
 - **Zig systems programming**
 - **GPU kernel optimization** (CUDA/Metal)
 - **ML model implementation**
 - **Web server development**
- **Performance optimization**
 - **Hardware-software co-design**
 - **Novel inference techniques** (Speculative decoding, quantization)

-## Project Timeline
+## Current Limitations & Next Steps

- Foundation and basic tensor ops
- Core transformer implementation  
- Backend optimization and web API
- Testing, benchmarking, deployment tools
+**🚧 What's Working**: Compiles, runs, measures performance  
+**⚠️ What's Missing**: Optimized algorithms, robust flows, actual DeepSeek V3 model  
+**📊 Performance Gap**: 1000x slower than production systems  
+**🎯 Next Priority**: BLAS integration and GPU acceleration  

-## Key Questions
-
-**Q: Why not just optimize PyTorch?**  
-A: PyTorch's Python overhead and GC pauses are fundamental limitations. Zig offers zero-cost abstractions, superior error handling, and deterministic performance.
-
-**Q: How will this compare to llama.cpp?**  
-A: Similar performance goals, but with built-in web API, better memory management, and focus on DeepSeek V3's specific MoE architecture.
-
-**Q: What about ONNX/TensorRT/ZML etc?**  
-A: Those are inference runtimes, not development frameworks / LLM frameworks. This project enables rapid iteration and custom optimization for research.
-
---
+See [experimental implementation](experimental/) for technical details and current benchmarks.

 ## References

- [DeepZig V3 (Experimental Start)](https://github.com/Triex/DeepZig-V3/tree/main/experimental)
+- [DeepZig V3 (Experimental Implementation)](experimental/) - **Current working code**
 - [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437) - Original model architecture
 - [Zig Language](https://ziglang.org/) - Language documentation
 - [Awesome Zig](https://github.com/C-BJ/awesome-zig) - Community resources
@ -225,5 +219,7 @@ A: Those are inference runtimes, not development frameworks / LLM frameworks. Th

 ---

-**Status**: 🎯 Seeking feedback & idea expansion<br/>
+**Status**: 🎯 **EXPERIMENTAL DRAFT** - Foundation compiles and runs basic operations ([see benchmarks](experimental/README.md#benchmarks))<br/>
 **Vision**: Foundation for advanced AI reasoning research
+
+**⚠️ Important**: This is a **research/development foundation** with draft/base implementations. Not ready for production use.
--- a/experimental/README.md
+++ b/experimental/README.md
@ -9,6 +9,7 @@ A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/)
 > - ✅ **SIMD-optimized tensor operations** (AVX2, NEON)
 > - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
 > - ✅ **Memory management** and backend architecture
+> - ✅ **Apple Silicon detection via sysctl calls**
 > 
 > **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.<br/>
 > **Performance Note**: Current implementation uses naive algorithms - matrix multiplication is ~1000x slower than optimized BLAS. See [benchmarks](#benchmarks) below.<br/>
@ -25,6 +26,8 @@ This experimental implementation aims to leverage Zig's unique advantages for sy
 - **Single binary deployment** with no runtime dependencies
 - **Cross-platform compilation** for multiple architectures

+**🔗 Related**: See the [main project README](../README.md) for architecture overview and vision.
+
 ## Project Structure

 ```
@ -230,20 +233,26 @@ Run benchmarks to measure performance:
 zig build bench
 ```

+**Hardware Context**: Benchmarks run on Apple M1 MacBook Pro (MacBookPro17,1) with 16GB unified memory, Zig 0.15.0-dev.703+597dd328e, debug build.
+
 Example output:
 ```
 🚀 DeepZig V3 Performance Benchmarks
 ==========================================

 Backend: CPU (SIMD optimized)
-Architecture: x86_64
-Thread count: 16
+Architecture: aarch64  
+Thread count: 8
+Hardware: Apple M1 MacBook Pro, 16GB unified memory

 Operation                      | Iterations |  Avg Time | Operations/s | Memory
 -------------------------------|------------|-----------|--------------|-------
 Tensor Creation (1024x1024)    |   1000 iter |     2.03 ms |        493 ops/s |   4.0 MB
 Tensor Addition (SIMD)         |    100 iter |     1.49 ms | 2806962690 ops/s |  48.0 MB  
-Matrix Multiplication          |     10 iter |  6418.08 ms |          0 GFLOPS |  12.0 MB
+Matrix Multiplication          |     10 iter |  6418.08 ms |         0 GFLOPS |  12.0 MB
+SwiGLU Activation              |   1000 iter |     4.44 ms |  236002478 ops/s |   12.0 MB
+RMS Normalization (SIMD)       |   1000 iter |     0.00 ms |    1077586 ops/s |    0.0 MB
+Memory Bandwidth               |    100 iter |     4.92 ms |         13 ops/s |  128.0 MB
 ```

 ## Known Issues
--- a/experimental/src/backends/metal/device.zig
+++ b/experimental/src/backends/metal/device.zig
@ -0,0 +1,306 @@
+// Metal Device detection and handling for Apple Silicon
+// Specifically optimized for M-series chips using proper system detection
+
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+const c = std.c;
+
+// Device information structure
+pub const MetalDeviceInfo = struct {
+    device_name: []const u8,
+    is_apple_silicon: bool,
+    is_m_series: bool,
+    series_generation: u8, // 1 = M1, 2 = M2, 3 = M3, etc.
+    variant: []const u8, // "Pro", "Max", "Ultra", etc.
+    unified_memory_size: u64,
+    has_anc: bool, // Apple Neural Engine
+
+    pub fn format(
+        self: @This(),
+        comptime fmt: []const u8,
+        options: std.fmt.FormatOptions,
+        writer: anytype,
+    ) !void {
+        _ = fmt;
+        _ = options;
+        try writer.print("Metal Device: {s} ({s}{d} {s})", .{
+            self.device_name,
+            if (self.is_m_series) "M" else "",
+            if (self.is_m_series) self.series_generation else 0,
+            if (self.is_m_series) self.variant else "",
+        });
+        try writer.print("\nUnified Memory: {} GB", .{self.unified_memory_size / (1024 * 1024 * 1024)});
+        try writer.print("\nApple Neural Engine: {}", .{if (self.has_anc) "Available" else "Not Available"});
+    }
+};
+
+// M-series chip information
+const MSeriesInfo = struct {
+    is_m_series: bool,
+    generation: u8,
+    variant: []const u8,
+};
+
+// System detection using sysctl
+const SysctlError = error{
+    NotFound,
+    BufferTooSmall,
+    SystemError,
+};
+
+/// Get sysctl string value
+fn getSysctlString(allocator: Allocator, name: []const u8) ![]const u8 {
+    // Only available on macOS
+    if (@import("builtin").os.tag != .macos) {
+        return SysctlError.NotFound;
+    }
+
+    var size: usize = 0;
+    
+    // First, get the size needed
+    const name_cstr = try allocator.dupeZ(u8, name);
+    defer allocator.free(name_cstr);
+    
+    if (c.sysctlbyname(name_cstr.ptr, null, &size, null, 0) != 0) {
+        return SysctlError.NotFound;
+    }
+    
+    // Allocate buffer and get the actual value
+    const buffer = try allocator.alloc(u8, size);
+    defer allocator.free(buffer);
+    
+    if (c.sysctlbyname(name_cstr.ptr, buffer.ptr, &size, null, 0) != 0) {
+        return SysctlError.SystemError;
+    }
+    
+    // Return a copy of the string (minus null terminator if present)
+    const len = if (size > 0 and buffer[size - 1] == 0) size - 1 else size;
+    return try allocator.dupe(u8, buffer[0..len]);
+}
+
+/// Get sysctl integer value
+fn getSysctlInt(comptime T: type, name: []const u8, allocator: Allocator) !T {
+    if (@import("builtin").os.tag != .macos) {
+        return SysctlError.NotFound;
+    }
+
+    var value: T = 0;
+    var size: usize = @sizeOf(T);
+    
+    const name_cstr = try allocator.dupeZ(u8, name);
+    defer allocator.free(name_cstr);
+    
+    if (c.sysctlbyname(name_cstr.ptr, &value, &size, null, 0) != 0) {
+        return SysctlError.NotFound;
+    }
+    
+    return value;
+}
+
+/// Check if running under Rosetta 2 translation
+fn isRunningUnderRosetta(allocator: Allocator) bool {
+    const result = getSysctlInt(i32, "sysctl.proc_translated", allocator) catch return false;
+    return result == 1;
+}
+
+/// Check if hardware supports ARM64 (Apple Silicon)
+fn isAppleSiliconHardware(allocator: Allocator) bool {
+    // Check for ARM64 support
+    const arm64_support = getSysctlInt(i32, "hw.optional.arm64", allocator) catch return false;
+    if (arm64_support == 1) return true;
+    
+    // Alternative check: CPU architecture
+    if (@import("builtin").target.cpu.arch == .aarch64) return true;
+    
+    // If running under Rosetta, we're on Apple Silicon
+    return isRunningUnderRosetta(allocator);
+}
+
+/// Parse M-series information from CPU brand string
+fn parseMSeriesInfo(cpu_brand: []const u8) MSeriesInfo {
+    // Default values
+    var result = MSeriesInfo{
+        .is_m_series = false,
+        .generation = 0,
+        .variant = "",
+    };
+    
+    // Look for Apple M pattern
+    if (std.mem.indexOf(u8, cpu_brand, "Apple M") == null) {
+        return result;
+    }
+    
+    result.is_m_series = true;
+    
+    // Extract generation and variant from CPU brand string
+    // Examples: "Apple M1", "Apple M1 Pro", "Apple M1 Max", "Apple M1 Ultra"
+    if (std.mem.indexOf(u8, cpu_brand, "M1")) |_| {
+        result.generation = 1;
+        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
+            result.variant = "Pro";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
+            result.variant = "Max";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
+            result.variant = "Ultra";
+        } else {
+            // Just "Apple M1" - this is the regular M1
+            result.variant = "";
+        }
+    } else if (std.mem.indexOf(u8, cpu_brand, "M2")) |_| {
+        result.generation = 2;
+        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
+            result.variant = "Pro";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
+            result.variant = "Max";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
+            result.variant = "Ultra";
+        } else {
+            result.variant = "";
+        }
+    } else if (std.mem.indexOf(u8, cpu_brand, "M3")) |_| {
+        result.generation = 3;
+        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
+            result.variant = "Pro";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
+            result.variant = "Max";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
+            result.variant = "Ultra";
+        } else {
+            result.variant = "";
+        }
+    } else if (std.mem.indexOf(u8, cpu_brand, "M4")) |_| {
+        result.generation = 4;
+        if (std.mem.indexOf(u8, cpu_brand, " Pro")) |_| {
+            result.variant = "Pro";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Max")) |_| {
+            result.variant = "Max";
+        } else if (std.mem.indexOf(u8, cpu_brand, " Ultra")) |_| {
+            result.variant = "Ultra";
+        } else {
+            result.variant = "";
+        }
+    }
+    
+    return result;
+}
+
+/// Try to detect GPU configuration for more detailed chip identification
+fn detectGPUCores(allocator: Allocator) u32 {
+    // Try to get GPU core count - this can help distinguish variants
+    // Regular M1: 7-8 GPU cores
+    // M1 Pro: 14-16 GPU cores  
+    // M1 Max: 24-32 GPU cores
+    
+    // This is a placeholder - actual implementation would query Metal API
+    // For now, return 0 to indicate unknown
+    _ = allocator;
+    return 0;
+}
+
+/// Detect Apple Silicon and M-series chip capabilities using proper system detection
+pub fn detectAppleSilicon(allocator: Allocator) !MetalDeviceInfo {
+    // Check at compile-time if we're on macOS
+    const is_macos = @import("builtin").os.tag == .macos;
+    if (!is_macos) {
+        return MetalDeviceInfo{
+            .device_name = try allocator.dupe(u8, "Non-macOS Device"),
+            .is_apple_silicon = false,
+            .is_m_series = false,
+            .series_generation = 0,
+            .variant = try allocator.dupe(u8, ""),
+            .unified_memory_size = 0,
+            .has_anc = false,
+        };
+    }
+    
+    // Detect Apple Silicon hardware
+    const is_apple_silicon = isAppleSiliconHardware(allocator);
+    if (!is_apple_silicon) {
+        return MetalDeviceInfo{
+            .device_name = try allocator.dupe(u8, "Intel Mac"),
+            .is_apple_silicon = false,
+            .is_m_series = false,
+            .series_generation = 0,
+            .variant = try allocator.dupe(u8, ""),
+            .unified_memory_size = 0,
+            .has_anc = false,
+        };
+    }
+    
+    // Get CPU brand string for M-series detection - this is the authoritative source
+    const cpu_brand = getSysctlString(allocator, "machdep.cpu.brand_string") catch "Apple Silicon";
+    defer allocator.free(cpu_brand);
+    
+    std.log.debug("CPU Brand String: '{s}'", .{cpu_brand});
+    
+    // Parse M-series information from the actual CPU brand string
+    const m_info = parseMSeriesInfo(cpu_brand);
+    
+    // Get additional hardware details for logging/debugging
+    const hw_model = getSysctlString(allocator, "hw.model") catch "";
+    defer if (hw_model.len > 0) allocator.free(hw_model);
+    
+    const gpu_cores = detectGPUCores(allocator);
+    if (gpu_cores > 0) {
+        std.log.debug("GPU Cores: {}", .{gpu_cores});
+    }
+    
+    std.log.debug("Hardware Model: '{s}'", .{hw_model});
+    std.log.debug("Detected M{d} {s}", .{ m_info.generation, m_info.variant });
+    
+    // Get system memory
+    const memory_size = getSysctlInt(u64, "hw.memsize", allocator) catch (16 * 1024 * 1024 * 1024); // Default 16GB
+    
+    // Get device name
+    const device_name = getSysctlString(allocator, "hw.model") catch "Apple Silicon Mac";
+    
+    return MetalDeviceInfo{
+        .device_name = device_name, // This will be owned by the caller
+        .is_apple_silicon = true,
+        .is_m_series = m_info.is_m_series,
+        .series_generation = m_info.generation,
+        .variant = try allocator.dupe(u8, m_info.variant), // Duplicate to ensure consistent allocation
+        .unified_memory_size = memory_size,
+        .has_anc = m_info.is_m_series, // All M-series have Apple Neural Engine
+    };
+}
+
+/// Get optimal GPU parameters for detected device
+pub fn getOptimalWorkGroupSize() u32 {
+    // These are reasonable defaults that should work well on most Apple GPU architectures
+    // In a real implementation, we would query Metal API for the actual optimal values
+    if (@import("builtin").target.cpu.arch == .aarch64) {
+        // Apple Silicon optimized values based on GPU core count
+        return 128;
+    }
+    
+    // Default for Intel Macs and others
+    return 64;
+}
+
+/// Get recommended memory allocation strategy based on device capabilities
+pub fn getMemoryStrategy() enum { UnifiedMemory, DiscreteMemory } {
+    // Check if we're on Apple Silicon hardware (even under Rosetta)
+    if (@import("builtin").os.tag == .macos) {
+        var gpa = std.heap.GeneralPurposeAllocator(.{}){};
+        defer _ = gpa.deinit();
+        const allocator = gpa.allocator();
+        
+        if (isAppleSiliconHardware(allocator)) {
+            return .UnifiedMemory; // Apple Silicon uses unified memory
+        }
+    }
+    
+    // For Intel Macs and other platforms
+    return .DiscreteMemory;
+}
+
+/// Get optimal tensor block size for current device
+pub fn getOptimalTensorBlockSize() u32 {
+    if (@import("builtin").target.cpu.arch == .aarch64) {
+        // Apple Silicon has more GPU cores and benefits from larger blocks
+        return 256;
+    } else {
+        return 128;
+    }
+}
--- a/experimental/src/backends/metal/memory.zig
+++ b/experimental/src/backends/metal/memory.zig
@ -0,0 +1,152 @@
+// Metal-specific memory management for Apple Silicon
+// Optimized for the unified memory architecture of M-series chips
+
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+const device = @import("device.zig");
+const MetalDeviceInfo = device.MetalDeviceInfo;
+
+/// Memory modes available for Metal buffers
+pub const MetalMemoryMode = enum {
+    /// Shared between CPU and GPU with automatic migration
+    Shared,
+    
+    /// Managed with separate CPU and GPU views but synchronized
+    Managed,
+    
+    /// GPU-only storage for maximum performance
+    Private,
+    
+    /// Memory visible to both CPU and GPU (Apple Silicon only)
+    Unified,
+};
+
+/// Buffer usage patterns to optimize memory allocation
+pub const MetalBufferUsage = enum {
+    /// Read often by GPU
+    GpuRead,
+    
+    /// Write often by GPU
+    GpuWrite,
+    
+    /// Read/write by both CPU and GPU
+    Shared,
+    
+    /// Used only temporarily for a single operation
+    Transient,
+};
+
+/// Memory manager for optimal Metal buffer allocation on M-series chips
+pub const MetalMemoryManager = struct {
+    allocator: Allocator,
+    device_info: ?MetalDeviceInfo,
+    total_allocated: usize,
+    max_allocation: usize,
+    
+    const Self = @This();
+    
+    /// Create a new Metal memory manager
+    pub fn init(allocator: Allocator, device_info: ?MetalDeviceInfo) Self {
+        return Self{
+            .allocator = allocator,
+            .device_info = device_info,
+            .total_allocated = 0,
+            .max_allocation = 0,
+        };
+    }
+    
+    /// Clean up any resources
+    pub fn deinit(self: *Self) void {
+        // Release any cached buffers or other resources
+        _ = self;
+    }
+    
+    /// Get the optimal memory mode based on device capabilities and usage pattern
+    pub fn getOptimalMemoryMode(self: *Self, usage: MetalBufferUsage) MetalMemoryMode {
+        // If we're on Apple Silicon, we can use unified memory
+        const is_apple_silicon = self.device_info != null and self.device_info.?.is_apple_silicon;
+        
+        if (is_apple_silicon) {
+            return switch (usage) {
+                .GpuRead => .Unified,
+                .GpuWrite => .Unified,
+                .Shared => .Unified,
+                .Transient => .Private, // Even on unified memory, transient data is better in private
+            };
+        } else {
+            // On Intel Macs with discrete GPU
+            return switch (usage) {
+                .GpuRead => .Managed,
+                .GpuWrite => .Private,
+                .Shared => .Managed,
+                .Transient => .Private,
+            };
+        }
+    }
+    
+    /// Get recommended allocation size (aligned to device preferences)
+    pub fn getOptimalAllocationSize(self: *Self, requested_size: usize) usize {
+        // M-series chips prefer certain memory alignment patterns
+        const alignment: usize = if (self.device_info != null and self.device_info.?.is_m_series) 
+            16 * 1024 // 16KB alignment on M-series
+        else 
+            4 * 1024; // 4KB on other devices
+            
+        return std.mem.alignForward(usize, requested_size, alignment);
+    }
+    
+    /// Track memory allocations for monitoring
+    pub fn trackAllocation(self: *Self, size: usize) void {
+        self.total_allocated += size;
+        self.max_allocation = std.math.max(self.max_allocation, self.total_allocated);
+    }
+    
+    /// Track memory deallocations
+    pub fn trackDeallocation(self: *Self, size: usize) void {
+        if (self.total_allocated >= size) {
+            self.total_allocated -= size;
+        } else {
+            self.total_allocated = 0;
+        }
+    }
+    
+    /// Get memory usage statistics
+    pub fn getMemoryStats(self: *Self) struct { 
+        current: usize,
+        peak: usize,
+        device_total: usize,
+    } {
+        const device_total = if (self.device_info != null) 
+            self.device_info.?.unified_memory_size 
+        else 
+            0;
+            
+        return .{
+            .current = self.total_allocated,
+            .peak = self.max_allocation,
+            .device_total = device_total,
+        };
+    }
+    
+    /// Get recommended buffer storage mode string for Metal API
+    pub fn getStorageModeString(mode: MetalMemoryMode) []const u8 {
+        return switch (mode) {
+            .Shared => "MTLStorageModeShared",
+            .Managed => "MTLStorageModeManaged",
+            .Private => "MTLStorageModePrivate",
+            .Unified => "MTLStorageModeShared", // Unified uses Shared on the API level
+        };
+    }
+};
+
+/// Helper to determine if hazard tracking should be enabled based on device capabilities
+pub fn shouldUseHazardTracking(device_info: ?MetalDeviceInfo) bool {
+    if (device_info == null) return false;
+    
+    // M3 and newer have better hazard tracking hardware
+    if (device_info.?.is_m_series and device_info.?.series_generation >= 3) {
+        return true;
+    }
+    
+    return false;
+}
--- a/experimental/src/backends/metal/root.zig
+++ b/experimental/src/backends/metal/root.zig
@ -4,12 +4,18 @@
 const std = @import("std");
 const deepseek_core = @import("deepseek_core");
 const Allocator = std.mem.Allocator;
+const metal_device = @import("device.zig");
+const MetalDeviceInfo = metal_device.MetalDeviceInfo;

 /// Metal backend implementation for Apple Silicon
 pub const MetalBackend = struct {
    allocator: Allocator,
    device_available: bool,
    unified_memory_size: u64,
+    device_info: ?MetalDeviceInfo,
+    optimal_work_group_size: u32,
+    memory_strategy: metal_device.getMemoryStrategy(),
+    tensor_block_size: u32,
    
    const Self = @This();
    
@ -17,10 +23,37 @@ pub const MetalBackend = struct {
        // Check if Metal is available (compile-time check for macOS)
        const metal_available = @import("builtin").os.tag == .macos;
        
+        var device_info: ?MetalDeviceInfo = null;
+        var unified_memory_size: u64 = 0;
+        var optimal_work_group_size: u32 = 64; // Default
+        var tensor_block_size: u32 = 128; // Default
+        
        if (metal_available) {
-            std.log.info("Metal Backend initialized on Apple Silicon");
-            // TODO: Initialize MTLDevice and command queue
-            // TODO: Query unified memory size
+            // Detect Apple Silicon and M-series capabilities
+            device_info = try metal_device.detectAppleSilicon(allocator);
+            unified_memory_size = device_info.?.unified_memory_size;
+            optimal_work_group_size = metal_device.getOptimalWorkGroupSize();
+            tensor_block_size = metal_device.getOptimalTensorBlockSize();
+            
+            std.log.info("Metal Backend initialized on {s}", .{device_info.?.device_name});
+            // Log detailed device information
+            if (device_info.?.is_apple_silicon) {
+                if (device_info.?.is_m_series) {
+                    std.log.info("Detected M{d} {s} with {d}GB unified memory", 
+                        .{
+                            device_info.?.series_generation, 
+                            device_info.?.variant,
+                            unified_memory_size / (1024 * 1024 * 1024),
+                        }
+                    );
+                } else {
+                    std.log.info("Detected Apple Silicon (non-M series) with {d}GB unified memory", 
+                        .{unified_memory_size / (1024 * 1024 * 1024)}
+                    );
+                }
+            } else {
+                std.log.warn("Metal is available but not running on Apple Silicon");
+            }
        } else {
            std.log.warn("Metal Backend not available on this platform");
        }
@ -28,7 +61,11 @@ pub const MetalBackend = struct {
        return Self{
            .allocator = allocator,
            .device_available = metal_available,
-            .unified_memory_size = if (metal_available) 16 * 1024 * 1024 * 1024 else 0, // 16GB default
+            .unified_memory_size = unified_memory_size,
+            .device_info = device_info,
+            .optimal_work_group_size = optimal_work_group_size,
+            .memory_strategy = metal_device.getMemoryStrategy(),
+            .tensor_block_size = tensor_block_size,
        };
    }
    
@ -54,14 +91,93 @@ pub const MetalBackend = struct {
            c.shape.dims[0], c.shape.dims[1]
        });
        
+        // Check if we're on Apple Silicon M series for optimized path
+        if (self.device_info != null and self.device_info.?.is_m_series) {
+            std.log.debug("Using optimized M{d} {s} matrix multiplication", 
+                .{
+                    self.device_info.?.series_generation, 
+                    self.device_info.?.variant
+                }
+            );
+            
+            // Select appropriate implementation based on M series generation
+            switch (self.device_info.?.series_generation) {
+                3 => return try self.matmulM3(a, b, c), // M3 optimized path
+                2 => return try self.matmulM2(a, b, c), // M2 optimized path
+                1 => return try self.matmulM1(a, b, c), // M1 optimized path
+                else => {} // Fall through to generic implementation
+            }
+        }
+        
        // TODO: Implement actual Metal compute shader
        // This would involve:
        // 1. Create MTLBuffer from tensor data
        // 2. Set up compute pipeline with matmul shader
-        // 3. Dispatch compute commands
+        // 3. Dispatch compute commands with optimized workgroup size based on device
        // 4. Copy results back to tensor
        
        // For now, fallback to CPU implementation
+        std.log.warn("Falling back to CPU implementation, Metal not implemented");
+        return error.NotImplemented;
+    }
+    
+    /// M1-optimized matrix multiplication
+    fn matmulM1(
+        self: *Self,
+        a: *deepseek_core.Tensor,
+        b: *const deepseek_core.Tensor,
+        c: *deepseek_core.Tensor,
+    ) !void {
+        _ = self;
+        _ = a;
+        _ = b;
+        _ = c;
+        
+        // TODO: M1-specific optimizations
+        // - Use MPSMatrixMultiplication with M1-specific parameters
+        // - Optimize for 7/8 GPU cores typically found in M1
+        // - Account for unified memory bandwidth on M1
+        
+        return error.NotImplemented;
+    }
+    
+    /// M2-optimized matrix multiplication
+    fn matmulM2(
+        self: *Self,
+        a: *deepseek_core.Tensor,
+        b: *const deepseek_core.Tensor,
+        c: *deepseek_core.Tensor,
+    ) !void {
+        _ = self;
+        _ = a;
+        _ = b;
+        _ = c;
+        
+        // TODO: M2-specific optimizations
+        // - Use MPSMatrixMultiplication with M2-specific parameters
+        // - Optimize for 8/10 GPU cores typically found in M2
+        // - Account for increased memory bandwidth on M2
+        
+        return error.NotImplemented;
+    }
+    
+    /// M3-optimized matrix multiplication
+    fn matmulM3(
+        self: *Self,
+        a: *deepseek_core.Tensor,
+        b: *const deepseek_core.Tensor,
+        c: *deepseek_core.Tensor,
+    ) !void {
+        _ = self;
+        _ = a;
+        _ = b;
+        _ = c;
+        
+        // TODO: M3-specific optimizations
+        // - Use MPSMatrixMultiplication with M3-specific parameters
+        // - Optimize for 10/16 GPU cores typically found in M3
+        // - Account for dynamic core switching on M3
+        
        return error.NotImplemented;
    }
    
@ -77,16 +193,59 @@ pub const MetalBackend = struct {
            return error.MetalNotAvailable;
        }
        
-        _ = input;
+        std.log.debug("Metal RMS normalization with {} elements", .{input.len});
+        
+        // Check if we're on Apple Silicon M series for optimized path
+        if (self.device_info != null and self.device_info.?.is_m_series) {
+            std.log.debug("Using optimized M{d} {s} RMS normalization", 
+                .{
+                    self.device_info.?.series_generation, 
+                    self.device_info.?.variant
+                }
+            );
+            
+            // Select optimal workgroup size based on M series generation
+            const workgroup_size = switch (self.device_info.?.series_generation) {
+                3 => 256, // M3 has more GPU cores
+                2 => 192, // M2 optimization
+                else => 128, // M1 and others
+            };
+            
+            // Determine if we should use unified memory approach
+            const use_unified_memory = self.memory_strategy == .UnifiedMemory;
+            
+            // Calculate optimal thread count based on input size and GPU cores
+            const thread_count = std.math.min(
+                std.math.alignForward(usize, input.len, workgroup_size),
+                workgroup_size * 1024 // Maximum reasonable thread count
+            );
+            
+            std.log.debug("RMS Norm using workgroup size: {}, threads: {}", 
+                .{workgroup_size, thread_count});
+            
+            // TODO: Implement Metal compute shader for RMS norm with M-series optimizations
+            // 1. Create buffers (potentially using managed storage mode for unified memory)
+            // 2. Set up compute pipeline with RMS norm shader
+            // 3. Dispatch compute with optimal work group size
+            // 4. Handle results with zero-copy when possible on unified memory
+            
+            if (!use_unified_memory) {
+                // Would handle non-unified memory path differently
+                std.log.debug("Using discrete memory path");
+            }
+            
+            // thread_count is used in the log message above, don't discard it
+        }
+        
+        // TODO: Complete implementation of Metal compute shader for RMS norm
+        // Metal excels at parallel operations like normalization
+        
+        // Don't discard input since it's used above for thread_count calculation
+        // Only discard these if not used above
        _ = weight;
        _ = output;
        _ = eps;
        
-        std.log.debug("Metal RMS normalization");
-        
-        // TODO: Implement Metal compute shader for RMS norm
-        // Metal excels at parallel operations like normalization
-        
        return error.NotImplemented;
    }
    
--- a/experimental/src/backends/metal/shader.zig
+++ b/experimental/src/backends/metal/shader.zig
@ -0,0 +1,254 @@
+// Metal shader utility for managing and optimizing Metal shaders
+// With specific optimizations for M-series Apple Silicon
+
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+const device = @import("device.zig");
+const MetalDeviceInfo = device.MetalDeviceInfo;
+
+/// Optimization level for Metal shaders
+pub const ShaderOptimizationLevel = enum {
+    none,
+    default,
+    performance,
+    size,
+    
+    /// Get the recommended optimization level based on device capabilities
+    pub fn fromDeviceInfo(device_info: ?MetalDeviceInfo) ShaderOptimizationLevel {
+        if (device_info == null) return .default;
+        
+        if (device_info.?.is_m_series) {
+            // M3 can handle highly optimized shaders
+            if (device_info.?.series_generation >= 3) {
+                return .performance;
+            } 
+            // M1/M2 balance between performance and size
+            else {
+                return .default;
+            }
+        }
+        
+        // For non-Apple Silicon, be more conservative
+        return .default;
+    }
+};
+
+/// Metal shader types
+pub const ShaderType = enum {
+    compute,
+    vertex,
+    fragment,
+    
+    pub fn toMTLFunctionType(self: ShaderType) []const u8 {
+        return switch (self) {
+            .compute => "MTLFunctionTypeKernel",
+            .vertex => "MTLFunctionTypeVertex",
+            .fragment => "MTLFunctionTypeFragment",
+        };
+    }
+};
+
+/// Metal shader source with metadata
+pub const ShaderSource = struct {
+    name: []const u8,
+    source_code: []const u8,
+    shader_type: ShaderType,
+    
+    /// Create a shader source with a given name and code
+    pub fn init(name: []const u8, source_code: []const u8, shader_type: ShaderType) ShaderSource {
+        return .{
+            .name = name,
+            .source_code = source_code,
+            .shader_type = shader_type,
+        };
+    }
+};
+
+/// Metal shader compilation options including M-series specific optimizations
+pub const ShaderCompileOptions = struct {
+    optimization_level: ShaderOptimizationLevel,
+    fast_math: bool,
+    preserve_invariance: bool,
+    
+    /// Create default options for a specific device
+    pub fn forDevice(device_info: ?MetalDeviceInfo) ShaderCompileOptions {
+        const opt_level = ShaderOptimizationLevel.fromDeviceInfo(device_info);
+        
+        // M-series chips benefit from fast math but some algorithms require precision
+        const fast_math = device_info != null and 
+                         device_info.?.is_m_series and 
+                         device_info.?.series_generation >= 2;
+        
+        return .{
+            .optimization_level = opt_level,
+            .fast_math = fast_math,
+            .preserve_invariance = false,
+        };
+    }
+};
+
+/// Utility for managing Metal shader compilation and caching
+pub const ShaderManager = struct {
+    allocator: Allocator,
+    device_info: ?MetalDeviceInfo,
+    compile_options: ShaderCompileOptions,
+    
+    const Self = @This();
+    
+    /// Create a new shader manager
+    pub fn init(
+        allocator: Allocator, 
+        device_info: ?MetalDeviceInfo
+    ) Self {
+        return Self{
+            .allocator = allocator,
+            .device_info = device_info,
+            .compile_options = ShaderCompileOptions.forDevice(device_info),
+        };
+    }
+    
+    /// Clean up resources
+    pub fn deinit(self: *Self) void {
+        _ = self;
+    }
+    
+    /// Get optimal threadgroup size for a compute shader on current device
+    pub fn getOptimalThreadgroupSize(self: *Self) struct { x: u32, y: u32, z: u32 } {
+        if (self.device_info == null or !self.device_info.?.is_apple_silicon) {
+            return .{ .x = 8, .y = 8, .z = 1 };
+        }
+        
+        // M-series chips have different optimal sizes
+        if (self.device_info.?.is_m_series) {
+            return switch (self.device_info.?.series_generation) {
+                3 => .{ .x = 16, .y = 16, .z = 1 }, // M3 has more GPU cores
+                2 => .{ .x = 16, .y = 8, .z = 1 },  // M2 
+                else => .{ .x = 8, .y = 8, .z = 1 }, // M1
+            };
+        }
+        
+        return .{ .x = 8, .y = 8, .z = 1 };
+    }
+    
+    /// Get memory barrier type based on hardware capabilities
+    pub fn getOptimalBarrierType(self: *Self) []const u8 {
+        // Newer M-series chips support more efficient memory barriers
+        if (self.device_info != null and 
+            self.device_info.?.is_m_series and 
+            self.device_info.?.series_generation >= 2) {
+            return "MTLBarrierScopeBuffers";
+        }
+        
+        return "MTLBarrierScopeTextures | MTLBarrierScopeBuffers";
+    }
+    
+    /// Generate compilation options string for Metal API
+    pub fn getCompileOptionsString(self: *Self) []const u8 {
+        _ = self;
+        // In a real implementation, this would return Objective-C code to set up
+        // MTLCompileOptions with the appropriate parameters
+        return "MTLCompileOptions"; // Placeholder
+    }
+};
+
+/// Create optimized Metal shaders for key operations based on device capabilities
+pub fn createOptimizedMetalShaders(device_info: ?MetalDeviceInfo) struct {
+    matmul: []const u8,
+    rms_norm: []const u8,
+    swiglu: []const u8,
+    attention: []const u8,
+} {
+    // Base versions of shaders
+    const base_matmul_shader = 
+        \\#include <metal_stdlib>
+        \\using namespace metal;
+        \\
+        \\kernel void matmul_kernel(
+        \\    device const float* a [[buffer(0)]],
+        \\    device const float* b [[buffer(1)]],
+        \\    device float* c [[buffer(2)]],
+        \\    constant uint& M [[buffer(3)]],
+        \\    constant uint& N [[buffer(4)]],
+        \\    constant uint& K [[buffer(5)]],
+        \\    uint2 gid [[thread_position_in_grid]]
+        \\) {
+        \\    if (gid.x >= N || gid.y >= M) return;
+        \\    
+        \\    float sum = 0.0;
+        \\    for (uint k = 0; k < K; k++) {
+        \\        sum += a[gid.y * K + k] * b[k * N + gid.x];
+        \\    }
+        \\    c[gid.y * N + gid.x] = sum;
+        \\}
+    ;
+    
+    const base_rms_norm_shader = 
+        \\#include <metal_stdlib>
+        \\using namespace metal;
+        \\
+        \\kernel void rms_norm_kernel(
+        \\    device const float* input [[buffer(0)]],
+        \\    device const float* weight [[buffer(1)]],
+        \\    device float* output [[buffer(2)]],
+        \\    constant uint& size [[buffer(3)]],
+        \\    constant float& eps [[buffer(4)]],
+        \\    uint idx [[thread_position_in_grid]]
+        \\) {
+        \\    if (idx >= size) return;
+        \\    
+        \\    // Calculate sum of squares
+        \\    float sum_sq = 0.0;
+        \\    for (uint i = 0; i < size; i++) {
+        \\        float val = input[i];
+        \\        sum_sq += val * val;
+        \\    }
+        \\    
+        \\    // RMS normalization
+        \\    float rms = sqrt(sum_sq / size + eps);
+        \\    output[idx] = input[idx] / rms * weight[idx];
+        \\}
+    ;
+    
+    // Default implementations
+    var matmul = base_matmul_shader;
+    var rms_norm = base_rms_norm_shader;
+    var swiglu = "";  // Placeholder
+    var attention = ""; // Placeholder
+    
+    // For M-series chips, we can use optimized implementations
+    if (device_info != null and device_info.?.is_m_series) {
+        // M3 optimizations
+        if (device_info.?.series_generation >= 3) {
+            // M3 has improved threadgroup memory, use tiled implementation
+            matmul = 
+                \\#include <metal_stdlib>
+                \\using namespace metal;
+                \\
+                \\kernel void matmul_kernel_optimized_m3(
+                \\    device const float* a [[buffer(0)]],
+                \\    device const float* b [[buffer(1)]],
+                \\    device float* c [[buffer(2)]],
+                \\    constant uint& M [[buffer(3)]],
+                \\    constant uint& N [[buffer(4)]],
+                \\    constant uint& K [[buffer(5)]],
+                \\    uint2 gid [[thread_position_in_grid]],
+                \\    uint2 tid [[thread_position_in_threadgroup]],
+                \\    uint2 tgid [[threadgroup_position_in_grid]]
+                \\) {
+                \\    // Advanced implementation with tiling and local memory
+                \\    // Optimized for M3 architecture
+                \\    // ...
+                \\}
+            ;
+            
+            // Similar optimizations for other kernels...
+        }
+    }
+    
+    return .{
+        .matmul = matmul,
+        .rms_norm = rms_norm,
+        .swiglu = swiglu,
+        .attention = attention,
+    };
+}
--- a/experimental/src/test_m_series.zig
+++ b/experimental/src/test_m_series.zig
@ -0,0 +1,39 @@
+// Test program for M series detection
+const std = @import("std");
+const metal_device = @import("backends/metal/device.zig");
+
+pub fn main() !void {
+    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
+    defer _ = gpa.deinit();
+    
+    const allocator = gpa.allocator();
+    
+    std.log.info("Testing M series detection...", .{});
+    
+    // Detect Apple Silicon and M-series capabilities
+    const device_info = try metal_device.detectAppleSilicon(allocator);
+    defer {
+        allocator.free(device_info.device_name);
+        allocator.free(device_info.variant);
+    }
+    
+    std.log.info("Device Info:", .{});
+    std.log.info("  Device Name: {s}", .{device_info.device_name});
+    std.log.info("  Is Apple Silicon: {}", .{device_info.is_apple_silicon});
+    std.log.info("  Is M Series: {}", .{device_info.is_m_series});
+    
+    if (device_info.is_m_series) {
+        std.log.info("  M Series Generation: {}", .{device_info.series_generation});
+        std.log.info("  Variant: {s}", .{device_info.variant});
+    }
+    
+    std.log.info("  Unified Memory: {} GB", .{device_info.unified_memory_size / (1024 * 1024 * 1024)});
+    std.log.info("  Has Apple Neural Engine: {}", .{device_info.has_anc});
+    
+    // Test other utility functions
+    std.log.info("Optimal Work Group Size: {}", .{metal_device.getOptimalWorkGroupSize()});
+    std.log.info("Memory Strategy: {s}", .{@tagName(metal_device.getMemoryStrategy())});
+    std.log.info("Optimal Tensor Block Size: {}", .{metal_device.getOptimalTensorBlockSize()});
+    
+    std.log.info("Test complete!", .{});
+}
--- a/experimental/zig-out/bin/deepseek-v3-zig
+++ b/experimental/zig-out/bin/deepseek-v3-zig