feat: Migrate experimental implementation to modern Zig, achieve clean compilation (private repo dump -> /experimental)

- Port HTTP server, and appropriate points across core etc from old API to Zig `0.15.0-dev` patterns - Fix mutability, unused variables, and API compatibility issues - Validate SIMD tensor operations and backend architecture - Foundation now compiles cleanly and produces working binary
2025-07-22 17:19:07 -04:00 · 2025-06-06 15:31:21 +10:00 · 2025-06-06 15:31:21 +10:00 · 31ef81000f
commit 31ef81000f
parent 5ff856c018
35 changed files with 4021 additions and 8 deletions
--- a/.gitignore
+++ b/.gitignore
@ -169,4 +169,7 @@ cython_debug/

 .vscode/*

-.DS_Store
+.DS_Store
+
+# Zig
+experimental/.zig-cache/
--- a/README.md
+++ b/README.md
@ -20,7 +20,14 @@

 ## Overview

-A proposal for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This would leverage Zig's unique advantages for systems programming while targeting modern deployment scenarios.
+A proposal & foundation for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
+
+**Status Update**: ✅ **Foundation compiles cleanly with theoretical implementation** with Zig 0.15.0-dev, including:
+- Working HTTP server with modern Zig API
+- SIMD-optimized tensor operations  
+- Cross-platform backend architecture
+- Professional memory management
+- Comprehensive build system

 ## Why This Matters

@ -67,11 +74,12 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:

 ## Implementation Plan

-### Phase 1: Foundation
- [ ] Set up Zig project structure
- [ ] Implement basic tensor operations with SIMD
- [ ] Create memory management system (arena allocators)
- [ ] Build HTTP server framework
+### Phase 1: Foundation ✅ **DRAFTED**
+- [x] Set up Zig project structure
+- [x] Implement basic tensor operations with SIMD
+- [x] Create memory management system (arena allocators)
+- [x] Build HTTP server framework
+- [x] **Updated to Zig 0.15.0-dev - compiles cleanly**

 ### Phase 2: Core Model
 - [ ] Implement transformer layers
@ -86,7 +94,7 @@ Current LLM inference is dominated by Python/PyTorch, which introduces:
 - [ ] Implement WebGPU for browsers

 ### Phase 4: Web Integration
- [ ] Complete HTTP API implementation
+- [x] Complete HTTP API implementation (basic structure)
 - [ ] Add WebSocket streaming
 - [ ] Build authentication/rate limiting
 - [ ] Create deployment tooling
--- a/experimental/README.md
+++ b/experimental/README.md
@ -0,0 +1,286 @@
+# DeepZig V3 Implementation 🚀
+
+A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/) for blazingly fast inference.
+
+> **⚠️ Status: Experimental Foundation** 
+> 
+> This project provides a **base foundation** for DeepSeek V3 in Zig with:
+> - ✅ **Working HTTP server** with OpenAI-compatible API
+> - ✅ **SIMD-optimized tensor operations** (AVX2, NEON)
+> - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
+> - ✅ **Memory management** and backend architecture
+> 
+> **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.
+> See [Development Status](#development-status) for details.
+
+## Overview
+
+This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine:
+
+- **Zero-cost abstractions** with compile-time optimization
+- **Direct hardware access** for SIMD and platform-specific optimizations  
+- **Manual memory management** without garbage collection pauses
+- **Single binary deployment** with no runtime dependencies
+- **Cross-platform compilation** for multiple architectures
+
+## Project Structure
+
+```
+experimental/
+├── build.zig              # Build system configuration
+├── build.zig.zon          # Package dependencies  
+├── src/
+│   ├── main.zig           # HTTP server entry point
+│   ├── core/              # Core ML components
+│   │   ├── root.zig       # Module exports
+│   │   ├── tensor.zig     # SIMD-optimized tensors
+│   │   ├── model.zig      # DeepSeek V3 model
+│   │   ├── attention.zig  # MLA attention mechanism
+│   │   ├── moe.zig        # Mixture of Experts
+│   │   ├── tokenizer.zig  # Text tokenization
+│   │   ├── backend.zig    # Backend abstraction
+│   │   ├── memory.zig     # Memory management
+│   │   └── math/          # Math utilities
+│   │       ├── root.zig   # Math module exports
+│   │       ├── simd.zig   # SIMD operations
+│   │       ├── activation.zig  # Activation functions
+│   │       └── rms_norm.zig    # RMS normalization
+│   ├── web/               # HTTP API layer
+│   │   ├── root.zig       # Web module exports
+│   │   ├── server.zig     # HTTP server (std.http)
+│   │   ├── handlers.zig   # Request handlers
+│   │   ├── middleware.zig # CORS, auth, rate limiting
+│   │   ├── websocket.zig  # WebSocket support
+│   │   ├── openai.zig     # OpenAI API compatibility
+│   │   ├── request.zig    # Request wrapper
+│   │   └── response.zig   # Response wrapper
+│   ├── backends/          # Compute backends
+│   │   ├── cpu/           # CPU with SIMD
+│   │   ├── metal/         # Apple Silicon
+│   │   └── cuda/          # NVIDIA GPUs
+│   └── wasm/
+│       └── main.zig       # WebAssembly entry point
+├── bench/
+│   └── main.zig           # Performance benchmarks
+└── README.md               # This file
+```
+
+## Requirements
+
+- **Zig 0.15.0-dev** or later
+- Platform-specific requirements:
+  - **macOS**: Xcode Command Line Tools (for Metal backend)
+  - **Linux**: CUDA Toolkit (for CUDA backend, optional)
+  - **Windows**: CUDA Toolkit (for CUDA backend, optional)
+
+## Quick Start
+
+### Building
+
+```bash
+# Clone and navigate to experimental directory
+cd experimental/
+
+# Build the project
+zig build
+
+# Run the server
+zig build run
+
+# Run tests
+zig build test
+
+# Run benchmarks
+zig build bench
+
+# Build WebAssembly
+zig build wasm
+```
+
+### Running the Server
+
+```bash
+# Start server on default port (8080)
+./zig-out/bin/deepseek-v3-zig
+
+# Custom configuration
+./zig-out/bin/deepseek-v3-zig --port 3000 --backend metal --model ./path/to/model
+```
+
+### API Usage
+
+The server exposes OpenAI-compatible endpoints:
+
+```bash
+# Chat completion
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-v3",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+
+# Health check
+curl http://localhost:8080/health
+
+# Model info
+curl http://localhost:8080/v1/models
+```
+
+## Performance Features
+
+### SIMD Optimizations
+
+- **x86_64**: AVX2/AVX-512 vectorization for matrix operations
+- **ARM64**: NEON SIMD for Apple Silicon optimization
+- **Auto-vectorization**: Compiler-optimized loops with `@Vector` types
+
+### Backend Support
+
+| Backend | Status | Features |
+|---------|--------|----------|
+| **CPU** | ✅ Implemented | Multi-threaded, SIMD, cache-optimized |
+| **Metal** | 🚧 In Progress | Apple Silicon GPU, unified memory |
+| **CUDA** | 🚧 Planned | NVIDIA GPU, Tensor Cores |
+| **WebGPU** | 📋 Future | Browser GPU acceleration |
+
+### Memory Management
+
+- **Arena allocators** for request-scoped memory
+- **Memory pools** for tensor allocations
+- **Zero-copy operations** where possible
+- **Cache-friendly** data layouts
+
+## Development Status
+
+### ✅ Drafted
+- [x] Project structure and build system
+- [x] Core tensor operations with SIMD
+- [x] HTTP server with OpenAI API compatibility
+- [x] CPU backend with optimizations
+- [x] Memory management utilities
+- [x] Benchmark suite
+
+### 🚧 In Progress
+- [ ] DeepSeek V3 model architecture
+- [ ] Multi-Head Latent Attention (MLA)
+- [ ] Mixture of Experts (MoE) implementation
+- [ ] Metal backend for Apple Silicon
+- [ ] Model loading and weight management
+
+### 📋 Planned
+- [ ] CUDA backend for NVIDIA GPUs
+- [ ] WebSocket streaming
+- [ ] Model quantization (INT8, FP16)
+- [ ] Flash Attention optimization
+- [ ] Distributed inference
+- [ ] Advanced sampling strategies
+
+## Architecture Decisions
+
+### Why Zig?
+
+1. **Performance**: Zero-cost abstractions without runtime overhead
+2. **Memory Safety**: Compile-time memory management without GC
+3. **Simplicity**: Single binary deployment, cross-compilation
+4. **Control**: Direct hardware access for optimization
+
+### Design Principles
+
+- **Modularity**: Clean separation between core, web, and backend layers
+- **Performance**: SIMD-first design with cache-friendly algorithms  
+- **Compatibility**: OpenAI API compatibility for easy adoption
+- **Extensibility**: Plugin architecture for new backends
+
+## Contributing
+
+This is an experimental project! Contributions are welcome:
+
+1. **Core ML**: Implement transformer layers, attention mechanisms
+2. **Backends**: Optimize CUDA/Metal compute kernels
+3. **Performance**: Profile and optimize bottlenecks
+4. **Testing**: Add comprehensive test coverage
+5. **Documentation**: Improve setup and usage guides
+
+### Development Setup
+
+```bash
+# Install Zig 0.15.0-dev
+# https://ziglang.org/download/
+
+# Clone repository
+git clone [repository-url]
+cd experimental/
+
+# Run tests during development
+zig build test --watch
+
+# Format code
+zig fmt src/
+```
+
+## Benchmarks
+
+Run benchmarks to measure performance:
+
+```bash
+zig build bench
+```
+
+Example output:
+```
+🚀 DeepZig V3 Performance Benchmarks
+==========================================
+
+Backend: CPU (SIMD optimized)
+Architecture: x86_64
+Thread count: 16
+
+Operation                      | Iterations |  Avg Time | Operations/s | Memory
+-------------------------------|------------|-----------|--------------|-------
+Tensor Creation (1024x1024)    |   1000 iter |     0.05 ms |   20000000 ops/s |   4.0 MB
+Tensor Addition (SIMD)         |    100 iter |     0.12 ms |  35000000000 ops/s |  48.0 MB
+Matrix Multiplication          |     10 iter |   125.30 ms |       17.2 GFLOPS |  12.0 MB
+```
+
+## Known Issues
+
+- **Model Loading**: Currently creates dummy models - real weight loading not implemented
+- **Tokenizer**: Placeholder implementation - needs proper BPE tokenizer
+- **WebSocket**: Basic structure only - streaming not implemented
+- **Metal/CUDA**: Backend stubs only - GPU kernels not implemented
+
+## License
+
+This experimental implementation follows the same license as the original DeepSeek V3 project.
+
+## Resources
+
+- [Original DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437)
+- [Zig Language Documentation](https://ziglang.org/documentation/master/)
+- [Zig Performance Guide](https://github.com/ziglang/zig/wiki/Performance)
+- [SIMD in Zig](https://ziglang.org/documentation/master/#Vectors)
+
+## Is This Ready for Production? 
+
+**No** - this is a research/development foundation. But it's **theoretical and compiles**:
+
+- **What works now**: ✅ Compiles with Zig 0.15.0-dev, tensor math, SIMD operations, benchmarks, backend architecture
+- **What's missing**: HTTP server API update, actual DeepSeek V3 model implementation  
+- **Timeline**: Foundation is **compiling**, model implementation is the next major milestone
+
+## Comparison to Other Projects
+
+| Project | Language | Status | Focus |
+|---------|----------|--------|-------|
+| **This** | Zig | Foundation + API | Web-first inference |
+| llama.cpp | C++ | Production | CLI/library |
+| Candle | Rust | Production | ML framework |
+| ZML | Zig | Research | Low-level ML ops |
+
+**Unique advantages**: Built-in web server, Zig's zero-cost abstractions, single binary deployment.
+
+---
+
+**⚡ Built with Zig for blazing fast LLM inference!** 
--- a/experimental/SETUP.md
+++ b/experimental/SETUP.md
@ -0,0 +1,285 @@
+# DeepZig V3 Implementation - Setup Guide
+
+This guide will help you set up the development environment and understand the project structure.
+
+## Prerequisites
+
+### 1. Install Zig 0.15.0-dev
+
+Download the latest development build from [ziglang.org/download](https://ziglang.org/download/):
+
+```bash
+# macOS (using Homebrew)
+brew install zig --HEAD
+
+# Linux (manual installation)
+wget https://ziglang.org/builds/zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
+tar -xf zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
+export PATH=$PATH:/path/to/zig
+
+# Verify installation
+zig version
+# Should show: 0.15.0-dev.xxx
+```
+
+### 2. Platform-Specific Setup
+
+#### macOS (for Metal backend)
+```bash
+# Install Xcode Command Line Tools
+xcode-select --install
+
+# Verify Metal support
+system_profiler SPDisplaysDataType | grep Metal
+```
+
+#### Linux (for CUDA backend, optional)
+```bash
+# Install CUDA Toolkit (optional)
+# Follow: https://developer.nvidia.com/cuda-downloads
+
+# For Ubuntu/Debian:
+wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
+sudo dpkg -i cuda-keyring_1.0-1_all.deb
+sudo apt-get update
+sudo apt-get -y install cuda
+```
+
+## Project Overview
+
+### Architecture
+
+```
+┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
+│   Web Layer     │    │   Core Engine    │    │   Backends      │
+│                 │    │                  │    │                 │
+│ ├─ HTTP API     │◄──►│ ├─ Transformer   │◄──►│ ├─ CPU (SIMD)   │
+│ ├─ WebSocket    │    │ ├─ Attention     │    │ ├─ Metal (macOS)│
+│ ├─ Rate Limit   │    │ ├─ MoE Routing   │    │ ├─ CUDA (Linux) │
+│ └─ Auth         │    │ └─ Tokenizer     │    │ └─ WebGPU       │
+└─────────────────┘    └──────────────────┘    └─────────────────┘
+```
+
+### Key Components
+
+#### Core Module (`src/core/`)
+- **Tensor Operations**: SIMD-optimized tensor math with AVX2/NEON support
+- **Model Architecture**: DeepSeek V3 implementation with MLA and MoE
+- **Memory Management**: Arena allocators and memory pools
+- **Backend Abstraction**: Unified interface for CPU/GPU computation
+
+#### Web Layer (`src/web/`)
+- **HTTP Server**: Built on `std.http.Server` (Zig 0.15.0 compatible)
+- **OpenAI API**: Compatible `/v1/chat/completions` endpoint
+- **Middleware**: CORS, authentication, rate limiting
+- **WebSocket**: Streaming inference support (planned)
+
+#### Backends (`src/backends/`)
+- **CPU**: Multi-threaded with SIMD optimizations
+- **Metal**: Apple Silicon GPU acceleration (macOS)
+- **CUDA**: NVIDIA GPU support with Tensor Cores (Linux/Windows)
+
+## Development Workflow
+
+### 1. Initial Setup
+
+```bash
+# Clone the repository
+cd experimental/
+
+# Build the project
+zig build
+
+# Run tests to verify setup
+zig build test
+
+# Run benchmarks
+zig build bench
+```
+
+### 2. Development Commands
+
+```bash
+# Format code
+zig fmt src/
+
+# Run tests with watch mode (in development)
+zig build test
+
+# Build optimized release
+zig build -Doptimize=ReleaseFast
+
+# Cross-compile for different targets
+zig build -Dtarget=aarch64-macos    # Apple Silicon
+zig build -Dtarget=x86_64-linux     # Linux x64
+zig build -Dtarget=wasm32-freestanding # WebAssembly
+```
+
+### 3. Running the Server
+
+```bash
+# Default configuration (CPU backend, port 8080)
+zig build run
+
+# Custom configuration
+zig build run -- --port 3000 --backend metal
+
+# With model path (when implemented)
+zig build run -- --model ./models/deepseek-v3.bin --backend cuda
+```
+
+### 4. Testing the API
+
+```bash
+# Health check
+curl http://localhost:8080/health
+
+# Model information
+curl http://localhost:8080/v1/models
+
+# Chat completion (placeholder response)
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-v3",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+```
+
+## Implementation Status
+
+### ✅ Ready for Development
+- [x] Build system and project structure
+- [x] Core tensor operations with SIMD
+- [x] HTTP server with basic routing
+- [x] OpenAI API compatibility layer
+- [x] Memory management utilities
+- [x] Benchmark framework
+
+### 🚧 Needs Implementation
+- [ ] **DeepSeek V3 Model**: Transformer architecture
+- [ ] **Attention Mechanism**: Multi-Head Latent Attention (MLA)
+- [ ] **MoE Implementation**: Expert routing and selection
+- [ ] **Tokenizer**: BPE tokenization (currently placeholder)
+- [ ] **Model Loading**: Weight file parsing and loading
+- [ ] **GPU Backends**: Metal and CUDA kernel implementations
+
+### 📋 Future Enhancements
+- [ ] Model quantization (INT8, FP16)
+- [ ] Flash Attention optimization
+- [ ] WebSocket streaming
+- [ ] Distributed inference
+- [ ] Model sharding
+
+## Code Style and Conventions
+
+### Zig Best Practices
+- Use `snake_case` for functions and variables
+- Use `PascalCase` for types and structs
+- Prefer explicit error handling with `!` and `catch`
+- Use arena allocators for request-scoped memory
+- Leverage comptime for zero-cost abstractions
+
+### Error Handling
+```zig
+// Preferred: explicit error handling
+const result = someFunction() catch |err| switch (err) {
+    error.OutOfMemory => return err,
+    error.InvalidInput => {
+        std.log.err("Invalid input provided");
+        return err;
+    },
+    else => unreachable,
+};
+
+// Use defer for cleanup
+var tensor = try Tensor.init(allocator, shape, .f32);
+defer tensor.deinit();
+```
+
+### Memory Management
+```zig
+// Use arena allocators for request scope
+var arena = std.heap.ArenaAllocator.init(allocator);
+defer arena.deinit();
+const request_allocator = arena.allocator();
+
+// Use memory pools for tensors
+var tensor_pool = TensorPool.init(allocator);
+defer tensor_pool.deinit();
+```
+
+## Performance Considerations
+
+### SIMD Optimization
+- Use `@Vector` types for SIMD operations
+- Align data to cache line boundaries (64 bytes)
+- Prefer blocked algorithms for better cache locality
+
+### Backend Selection
+- **CPU**: Best for smaller models, development
+- **Metal**: Optimal for Apple Silicon (M1/M2/M3)
+- **CUDA**: Best for NVIDIA GPUs with Tensor Cores
+
+### Memory Layout
+- Use structure-of-arrays (SoA) for better vectorization
+- Minimize memory allocations in hot paths
+- Leverage unified memory on Apple Silicon
+
+## Debugging and Profiling
+
+### Debug Build
+```bash
+# Build with debug symbols
+zig build -Doptimize=Debug
+
+# Run with verbose logging
+RUST_LOG=debug zig build run
+```
+
+### Performance Profiling
+```bash
+# Run benchmarks
+zig build bench
+
+# Profile with system tools
+# macOS: Instruments.app
+# Linux: perf, valgrind
+# Windows: Visual Studio Diagnostics
+```
+
+## Next Steps
+
+1. **Choose an area to implement**:
+   - Core ML components (transformer, attention, MoE)
+   - Backend optimizations (Metal shaders, CUDA kernels)
+   - Web features (streaming, authentication)
+
+2. **Read the code**:
+   - Start with `src/core/root.zig` for module structure
+   - Check `src/main.zig` for the server entry point
+   - Look at `bench/main.zig` for performance testing
+
+3. **Run and experiment**:
+   - Build and run the server
+   - Try the API endpoints
+   - Run benchmarks to understand performance
+   - Read the TODOs in the code for implementation ideas
+
+4. **Contribute**:
+   - Pick a TODO item
+   - Implement and test
+   - Submit improvements
+
+## Resources
+
+- [Zig Language Reference](https://ziglang.org/documentation/master/)
+- [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437)
+- [Zig SIMD Guide](https://ziglang.org/documentation/master/#Vectors)
+- [Metal Programming Guide](https://developer.apple.com/metal/)
+- [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)
+
+---
+
+Ready to build the future of high-performance LLM inference! 🚀 
--- a/experimental/bench/main.zig
+++ b/experimental/bench/main.zig
@ -0,0 +1,311 @@
+// Benchmark Suite for DeepZig V3 Implementation
+// Tests performance of core operations across different backends
+
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+const cpu_backend = @import("cpu_backend");
+const print = std.debug.print;
+
+const BenchmarkResult = struct {
+    name: []const u8,
+    iterations: u32,
+    total_time_ns: u64,
+    avg_time_ns: u64,
+    ops_per_second: f64,
+    memory_used_mb: f64,
+    
+    pub fn format(
+        self: BenchmarkResult,
+        comptime fmt: []const u8,
+        options: std.fmt.FormatOptions,
+        writer: anytype,
+    ) !void {
+        _ = fmt;
+        _ = options;
+        try writer.print(
+            "{s:30} | {d:6} iter | {d:8.2} ms | {d:10.0} ops/s | {d:6.1} MB",
+            .{ self.name, self.iterations, @as(f64, @floatFromInt(self.avg_time_ns)) / 1_000_000.0, self.ops_per_second, self.memory_used_mb }
+        );
+    }
+};
+
+pub fn main() !void {
+    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
+    defer _ = gpa.deinit();
+    const allocator = gpa.allocator();
+    
+    print("🚀 DeepZig V3 Performance Benchmarks\n");
+    print("==========================================\n\n");
+    
+    // Initialize backends
+    const cpu_backend_instance = try cpu_backend.init(allocator);
+    defer cpu_backend_instance.deinit();
+    
+    print("Backend: CPU (SIMD optimized)\n");
+    print("Architecture: {s}\n", @tagName(@import("builtin").cpu.arch));
+    print("Thread count: {d}\n\n", .{std.Thread.getCpuCount() catch 4});
+    
+    // Run benchmarks
+    var results = std.ArrayList(BenchmarkResult).init(allocator);
+    defer results.deinit();
+    
+    // Tensor operations
+    try results.append(try benchmarkTensorCreation(allocator));
+    try results.append(try benchmarkTensorAddition(allocator));
+    try results.append(try benchmarkMatrixMultiplication(allocator));
+    
+    // Activation functions
+    try results.append(try benchmarkSwiGLU(allocator));
+    try results.append(try benchmarkRMSNorm(allocator));
+    
+    // Memory operations
+    try results.append(try benchmarkMemoryBandwidth(allocator));
+    
+    // Print results
+    print("Benchmark Results:\n");
+    print("------------------\n");
+    print("Operation                      | Iterations |  Avg Time | Operations/s | Memory\n");
+    print("-------------------------------|------------|-----------|--------------|-------\n");
+    
+    for (results.items) |result| {
+        print("{}\n", .{result});
+    }
+    
+    print("\n🎯 Benchmark completed!\n");
+}
+
+/// Benchmark tensor creation and memory allocation
+fn benchmarkTensorCreation(allocator: std.mem.Allocator) !BenchmarkResult {
+    const iterations = 1000;
+    const shape = deepseek_core.Tensor.Shape.init(&[_]u32{ 1024, 1024 });
+    
+    const start_time = std.time.nanoTimestamp();
+    
+    for (0..iterations) |_| {
+        var tensor = try deepseek_core.Tensor.zeros(allocator, shape, .f32);
+        tensor.deinit();
+    }
+    
+    const end_time = std.time.nanoTimestamp();
+    const total_time = @as(u64, @intCast(end_time - start_time));
+    const avg_time = total_time / iterations;
+    
+    return BenchmarkResult{
+        .name = "Tensor Creation (1024x1024)",
+        .iterations = iterations,
+        .total_time_ns = total_time,
+        .avg_time_ns = avg_time,
+        .ops_per_second = @as(f64, @floatFromInt(iterations)) / (@as(f64, @floatFromInt(total_time)) / 1_000_000_000.0),
+        .memory_used_mb = (1024.0 * 1024.0 * 4.0) / (1024.0 * 1024.0), // 4MB tensor
+    };
+}
+
+/// Benchmark SIMD-optimized tensor addition
+fn benchmarkTensorAddition(allocator: std.mem.Allocator) !BenchmarkResult {
+    const iterations = 100;
+    const shape = deepseek_core.Tensor.Shape.init(&[_]u32{ 4096, 1024 });
+    
+    var a = try deepseek_core.Tensor.ones(allocator, shape, .f32);
+    defer a.deinit();
+    
+    var b = try deepseek_core.Tensor.ones(allocator, shape, .f32);
+    defer b.deinit();
+    
+    var result = try deepseek_core.Tensor.zeros(allocator, shape, .f32);
+    defer result.deinit();
+    
+    const start_time = std.time.nanoTimestamp();
+    
+    for (0..iterations) |_| {
+        try a.add(&b, &result);
+    }
+    
+    const end_time = std.time.nanoTimestamp();
+    const total_time = @as(u64, @intCast(end_time - start_time));
+    const avg_time = total_time / iterations;
+    
+    const elements_per_iter = shape.numel();
+    const total_elements = elements_per_iter * iterations;
+    const ops_per_second = @as(f64, @floatFromInt(total_elements)) / (@as(f64, @floatFromInt(total_time)) / 1_000_000_000.0);
+    
+    return BenchmarkResult{
+        .name = "Tensor Addition (SIMD)",
+        .iterations = iterations,
+        .total_time_ns = total_time,
+        .avg_time_ns = avg_time,
+        .ops_per_second = ops_per_second,
+        .memory_used_mb = (4096.0 * 1024.0 * 4.0 * 3.0) / (1024.0 * 1024.0), // 3 tensors
+    };
+}
+
+/// Benchmark matrix multiplication performance
+fn benchmarkMatrixMultiplication(allocator: std.mem.Allocator) !BenchmarkResult {
+    const iterations = 10;
+    const m = 1024;
+    const k = 1024;
+    const n = 1024;
+    
+    const a_shape = deepseek_core.Tensor.Shape.init(&[_]u32{ m, k });
+    const b_shape = deepseek_core.Tensor.Shape.init(&[_]u32{ k, n });
+    const c_shape = deepseek_core.Tensor.Shape.init(&[_]u32{ m, n });
+    
+    var a = try deepseek_core.Tensor.ones(allocator, a_shape, .f32);
+    defer a.deinit();
+    
+    var b = try deepseek_core.Tensor.ones(allocator, b_shape, .f32);
+    defer b.deinit();
+    
+    var c = try deepseek_core.Tensor.zeros(allocator, c_shape, .f32);
+    defer c.deinit();
+    
+    const start_time = std.time.nanoTimestamp();
+    
+    for (0..iterations) |_| {
+        try a.matmul(&b, &c);
+    }
+    
+    const end_time = std.time.nanoTimestamp();
+    const total_time = @as(u64, @intCast(end_time - start_time));
+    const avg_time = total_time / iterations;
+    
+    // FLOPS calculation: 2 * M * N * K operations per matrix multiplication
+    const flops_per_iter = 2 * m * n * k;
+    const total_flops = flops_per_iter * iterations;
+    const gflops_per_second = (@as(f64, @floatFromInt(total_flops)) / (@as(f64, @floatFromInt(total_time)) / 1_000_000_000.0)) / 1_000_000_000.0;
+    
+    return BenchmarkResult{
+        .name = "Matrix Multiplication",
+        .iterations = iterations,
+        .total_time_ns = total_time,
+        .avg_time_ns = avg_time,
+        .ops_per_second = gflops_per_second, // Actually GFLOPS
+        .memory_used_mb = (@as(f64, @floatFromInt(m + k + n)) * 1024.0 * 4.0) / (1024.0 * 1024.0),
+    };
+}
+
+/// Benchmark SwiGLU activation function
+fn benchmarkSwiGLU(allocator: std.mem.Allocator) !BenchmarkResult {
+    const iterations = 1000;
+    const size = 1024 * 1024; // 1M elements
+    
+    const input = try allocator.alloc(f32, size);
+    defer allocator.free(input);
+    
+    const gate = try allocator.alloc(f32, size);
+    defer allocator.free(gate);
+    
+    const output = try allocator.alloc(f32, size);
+    defer allocator.free(output);
+    
+    // Fill with random data
+    for (input, gate) |*i, *g| {
+        i.* = 0.5;
+        g.* = 0.3;
+    }
+    
+    const start_time = std.time.nanoTimestamp();
+    
+    for (0..iterations) |_| {
+        // SwiGLU: input * swish(gate)
+        for (0..size) |i| {
+            const g = gate[i];
+            const swish_g = g / (1.0 + @exp(-g));
+            output[i] = input[i] * swish_g;
+        }
+    }
+    
+    const end_time = std.time.nanoTimestamp();
+    const total_time = @as(u64, @intCast(end_time - start_time));
+    const avg_time = total_time / iterations;
+    
+    const total_elements = size * iterations;
+    const ops_per_second = @as(f64, @floatFromInt(total_elements)) / (@as(f64, @floatFromInt(total_time)) / 1_000_000_000.0);
+    
+    return BenchmarkResult{
+        .name = "SwiGLU Activation",
+        .iterations = iterations,
+        .total_time_ns = total_time,
+        .avg_time_ns = avg_time,
+        .ops_per_second = ops_per_second,
+        .memory_used_mb = (@as(f64, @floatFromInt(size)) * 3.0 * 4.0) / (1024.0 * 1024.0),
+    };
+}
+
+/// Benchmark RMS normalization
+fn benchmarkRMSNorm(allocator: std.mem.Allocator) !BenchmarkResult {
+    const iterations = 1000;
+    const size = 4096; // Typical hidden dimension
+    
+    const input = try allocator.alloc(f32, size);
+    defer allocator.free(input);
+    
+    const weight = try allocator.alloc(f32, size);
+    defer allocator.free(weight);
+    
+    const output = try allocator.alloc(f32, size);
+    defer allocator.free(output);
+    
+    // Initialize data
+    for (input, weight) |*i, *w| {
+        i.* = 0.1;
+        w.* = 1.0;
+    }
+    
+    const start_time = std.time.nanoTimestamp();
+    
+    for (0..iterations) |_| {
+        deepseek_core.math.rms_norm.rmsNormVec(input, weight, output, 1e-6);
+    }
+    
+    const end_time = std.time.nanoTimestamp();
+    const total_time = @as(u64, @intCast(end_time - start_time));
+    const avg_time = total_time / iterations;
+    
+    const ops_per_second = @as(f64, @floatFromInt(iterations)) / (@as(f64, @floatFromInt(total_time)) / 1_000_000_000.0);
+    
+    return BenchmarkResult{
+        .name = "RMS Normalization (SIMD)",
+        .iterations = iterations,
+        .total_time_ns = total_time,
+        .avg_time_ns = avg_time,
+        .ops_per_second = ops_per_second,
+        .memory_used_mb = (@as(f64, @floatFromInt(size)) * 3.0 * 4.0) / (1024.0 * 1024.0),
+    };
+}
+
+/// Benchmark memory bandwidth
+fn benchmarkMemoryBandwidth(allocator: std.mem.Allocator) !BenchmarkResult {
+    const iterations = 100;
+    const size = 64 * 1024 * 1024; // 64MB
+    
+    const source = try allocator.alloc(u8, size);
+    defer allocator.free(source);
+    
+    const dest = try allocator.alloc(u8, size);
+    defer allocator.free(dest);
+    
+    // Fill source with data
+    @memset(source, 0x42);
+    
+    const start_time = std.time.nanoTimestamp();
+    
+    for (0..iterations) |_| {
+        @memcpy(dest, source);
+    }
+    
+    const end_time = std.time.nanoTimestamp();
+    const total_time = @as(u64, @intCast(end_time - start_time));
+    const avg_time = total_time / iterations;
+    
+    const total_bytes = size * iterations;
+    const gb_per_second = (@as(f64, @floatFromInt(total_bytes)) / (@as(f64, @floatFromInt(total_time)) / 1_000_000_000.0)) / (1024.0 * 1024.0 * 1024.0);
+    
+    return BenchmarkResult{
+        .name = "Memory Bandwidth",
+        .iterations = iterations,
+        .total_time_ns = total_time,
+        .avg_time_ns = avg_time,
+        .ops_per_second = gb_per_second, // Actually GB/s
+        .memory_used_mb = (@as(f64, @floatFromInt(size)) * 2.0) / (1024.0 * 1024.0),
+    };
+} 
--- a/experimental/build.zig
+++ b/experimental/build.zig
@ -0,0 +1,151 @@
+const std = @import("std");
+
+pub fn build(b: *std.Build) void {
+    // Standard optimization options
+    const target = b.standardTargetOptions(.{});
+    const optimize = b.standardOptimizeOption(.{});
+
+    // === CORE LIBRARY MODULE ===
+    const deepseek_core = b.addModule("deepseek_core", .{
+        .root_source_file = b.path("src/core/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+
+    // === WEB LAYER MODULE ===
+    const web_layer = b.addModule("web_layer", .{
+        .root_source_file = b.path("src/web/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+    web_layer.addImport("deepseek_core", deepseek_core);
+
+    // === BACKEND MODULES ===
+    const cpu_backend = b.addModule("cpu_backend", .{
+        .root_source_file = b.path("src/backends/cpu/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+    cpu_backend.addImport("deepseek_core", deepseek_core);
+
+    const metal_backend = b.addModule("metal_backend", .{
+        .root_source_file = b.path("src/backends/metal/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+    metal_backend.addImport("deepseek_core", deepseek_core);
+
+    const cuda_backend = b.addModule("cuda_backend", .{
+        .root_source_file = b.path("src/backends/cuda/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+    cuda_backend.addImport("deepseek_core", deepseek_core);
+
+    // === MAIN EXECUTABLE ===
+    const exe = b.addExecutable(.{
+        .name = "deepseek-v3-zig",
+        .root_source_file = b.path("src/main.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+
+    // Add imports to main executable
+    exe.root_module.addImport("deepseek_core", deepseek_core);
+    exe.root_module.addImport("web_layer", web_layer);
+    exe.root_module.addImport("cpu_backend", cpu_backend);
+    exe.root_module.addImport("metal_backend", metal_backend);
+    exe.root_module.addImport("cuda_backend", cuda_backend);
+
+    // Platform-specific backend linking
+    if (target.result.os.tag == .macos) {
+        exe.linkFramework("Metal");
+        exe.linkFramework("MetalKit");
+        exe.linkFramework("Foundation");
+    }
+
+    // CUDA linking for Linux/Windows
+    if (target.result.os.tag == .linux or target.result.os.tag == .windows) {
+        // TODO: Add CUDA library paths when available
+        // exe.addLibraryPath(b.path("cuda/lib"));
+        // exe.linkSystemLibrary("cuda");
+        // exe.linkSystemLibrary("cublas");
+    }
+
+    b.installArtifact(exe);
+
+    // === RUN COMMAND ===
+    const run_cmd = b.addRunArtifact(exe);
+    run_cmd.step.dependOn(b.getInstallStep());
+
+    if (b.args) |args| {
+        run_cmd.addArgs(args);
+    }
+
+    const run_step = b.step("run", "Run the DeepSeek V3 server");
+    run_step.dependOn(&run_cmd.step);
+
+    // === TESTING ===
+    const test_step = b.step("test", "Run unit tests");
+
+    // Core tests
+    const core_tests = b.addTest(.{
+        .root_source_file = b.path("src/core/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+    test_step.dependOn(&b.addRunArtifact(core_tests).step);
+
+    // Web tests
+    const web_tests = b.addTest(.{
+        .root_source_file = b.path("src/web/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+    web_tests.root_module.addImport("deepseek_core", deepseek_core);
+    test_step.dependOn(&b.addRunArtifact(web_tests).step);
+
+    // Backend tests
+    const cpu_tests = b.addTest(.{
+        .root_source_file = b.path("src/backends/cpu/root.zig"),
+        .target = target,
+        .optimize = optimize,
+    });
+    cpu_tests.root_module.addImport("deepseek_core", deepseek_core);
+    test_step.dependOn(&b.addRunArtifact(cpu_tests).step);
+
+    // === BENCHMARKS ===
+    const bench_step = b.step("bench", "Run benchmarks");
+    
+    const bench_exe = b.addExecutable(.{
+        .name = "bench",
+        .root_source_file = b.path("bench/main.zig"),
+        .target = target,
+        .optimize = .ReleaseFast,
+    });
+    bench_exe.root_module.addImport("deepseek_core", deepseek_core);
+    bench_exe.root_module.addImport("cpu_backend", cpu_backend);
+    
+    const bench_run = b.addRunArtifact(bench_exe);
+    bench_step.dependOn(&bench_run.step);
+
+    // === WASM TARGET ===
+    const wasm_step = b.step("wasm", "Build WebAssembly target");
+    const wasm_target = b.resolveTargetQuery(.{
+        .cpu_arch = .wasm32,
+        .os_tag = .freestanding,
+    });
+    
+    const wasm_exe = b.addExecutable(.{
+        .name = "deepseek-v3-wasm",
+        .root_source_file = b.path("src/wasm/main.zig"),
+        .target = wasm_target,
+        .optimize = .ReleaseSmall,
+    });
+    wasm_exe.root_module.addImport("deepseek_core", deepseek_core);
+    wasm_exe.entry = .disabled;
+    wasm_exe.rdynamic = true;
+    
+    const wasm_install = b.addInstallArtifact(wasm_exe, .{});
+    wasm_step.dependOn(&wasm_install.step);
+} 
--- a/experimental/build.zig.zon
+++ b/experimental/build.zig.zon
@ -0,0 +1,32 @@
+.{
+    .name = .deepzig_v3,
+    .version = "0.1.0",
+    .fingerprint = 0x602e54a97e1751eb,
+    .minimum_zig_version = "0.15.0-dev.703",
+
+    .dependencies = .{
+        // HTTP/Web dependencies
+        // TODO: Add when available for 0.15.0-dev
+        // .httpz = .{
+        //     .url = "https://github.com/karlseguin/http.zig/archive/refs/heads/master.tar.gz",
+        //     .hash = "1220...",
+        // },
+        
+        // JSON parsing
+        // TODO: Add structured JSON library if needed beyond std.json
+        
+        // SIMD/Math libraries
+        // TODO: Add optimized math libraries if available
+        
+        // Tokenizer dependencies
+        // TODO: Add tokenizer libraries or implement from scratch
+    },
+
+    .paths = .{
+        "build.zig",
+        "build.zig.zon",
+        "src",
+        "bench",
+        "README.md",
+    },
+} 
--- a/experimental/src/backends/cpu/root.zig
+++ b/experimental/src/backends/cpu/root.zig
@ -0,0 +1,245 @@
+// CPU Backend for DeepSeek V3
+// Optimized for x86_64 (AVX2) and ARM64 (NEON) SIMD instructions
+
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+const Allocator = std.mem.Allocator;
+
+/// CPU-specific backend implementation
+pub const CpuBackend = struct {
+    allocator: Allocator,
+    thread_pool: std.Thread.Pool,
+    capabilities: deepseek_core.Backend.Capabilities,
+    
+    const Self = @This();
+    
+    /// Initialize CPU backend with optimal thread count
+    pub fn init(allocator: Allocator) !Self {
+        const thread_count = @max(1, std.Thread.getCpuCount() catch 4);
+        var thread_pool: std.Thread.Pool = undefined;
+        try thread_pool.init(.{ .allocator = allocator, .n_jobs = thread_count });
+        
+        std.log.info("CPU Backend initialized with {} threads", .{thread_count});
+        
+        return Self{
+            .allocator = allocator,
+            .thread_pool = thread_pool,
+            .capabilities = detectCapabilities(),
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        self.thread_pool.deinit();
+    }
+    
+    /// Matrix multiplication optimized for CPU
+    pub fn matmul(
+        self: *Self,
+        a: *deepseek_core.Tensor,
+        b: *const deepseek_core.Tensor,
+        c: *deepseek_core.Tensor,
+    ) !void {
+        if (a.dtype != .f32 or b.dtype != .f32 or c.dtype != .f32) {
+            return error.UnsupportedDataType;
+        }
+        
+        const a_data = try a.asSliceF32();
+        const b_data = @as([]const f32, @alignCast(std.mem.bytesAsSlice(f32, b.data)));
+        const c_data = try c.asSliceF32();
+        
+        const m = a.shape.dims[0];
+        const k = a.shape.dims[1];
+        const n = b.shape.dims[1];
+        
+        // Use blocking algorithm for better cache performance
+        const block_size = 64; // Optimized for L1 cache
+        
+        var i: usize = 0;
+        while (i < m) : (i += block_size) {
+            var j: usize = 0;
+            while (j < n) : (j += block_size) {
+                var l: usize = 0;
+                while (l < k) : (l += block_size) {
+                    const i_end = @min(i + block_size, m);
+                    const j_end = @min(j + block_size, n);
+                    const l_end = @min(l + block_size, k);
+                    
+                    try self.matmulBlock(
+                        a_data, b_data, c_data,
+                        i, i_end, j, j_end, l, l_end,
+                        k, n
+                    );
+                }
+            }
+        }
+    }
+    
+    /// Blocked matrix multiplication with SIMD
+    fn matmulBlock(
+        self: *Self,
+        a: []const f32,
+        b: []const f32,
+        c: []f32,
+        i_start: usize, i_end: usize,
+        j_start: usize, j_end: usize,
+        l_start: usize, l_end: usize,
+        k: usize, n: usize,
+    ) !void {
+        _ = self;
+        
+        const VecSize = if (@import("builtin").cpu.arch == .x86_64) 8 else 4;
+        
+        var i = i_start;
+        while (i < i_end) : (i += 1) {
+            var j = j_start;
+            
+            // Vectorized inner loop
+            while (j + VecSize <= j_end) : (j += VecSize) {
+                var sum_vec: @Vector(VecSize, f32) = @splat(0.0);
+                
+                var l = l_start;
+                while (l < l_end) : (l += 1) {
+                    const a_val: @Vector(VecSize, f32) = @splat(a[i * k + l]);
+                    const b_vals: @Vector(VecSize, f32) = b[l * n + j..l * n + j + VecSize][0..VecSize].*;
+                    sum_vec = @mulAdd(@Vector(VecSize, f32), a_val, b_vals, sum_vec);
+                }
+                
+                c[i * n + j..i * n + j + VecSize][0..VecSize].* = sum_vec;
+            }
+            
+            // Handle remainder
+            while (j < j_end) : (j += 1) {
+                var sum: f32 = 0.0;
+                var l = l_start;
+                while (l < l_end) : (l += 1) {
+                    sum += a[i * k + l] * b[l * n + j];
+                }
+                c[i * n + j] = sum;
+            }
+        }
+    }
+    
+    /// Optimized RMS normalization
+    pub fn rmsNorm(
+        self: *Self,
+        input: []const f32,
+        weight: []const f32,
+        output: []f32,
+        eps: f32,
+    ) !void {
+        _ = self;
+        
+        const VecSize = if (@import("builtin").cpu.arch == .x86_64) 8 else 4;
+        const vec_len = input.len / VecSize * VecSize;
+        
+        // Compute mean square using SIMD
+        var sum_squares: @Vector(VecSize, f32) = @splat(0.0);
+        var i: usize = 0;
+        while (i < vec_len) : (i += VecSize) {
+            const x: @Vector(VecSize, f32) = input[i..i+VecSize][0..VecSize].*;
+            sum_squares = @mulAdd(@Vector(VecSize, f32), x, x, sum_squares);
+        }
+        
+        // Sum vector elements
+        var mean_square: f32 = 0.0;
+        for (0..VecSize) |j| {
+            mean_square += sum_squares[j];
+        }
+        
+        // Handle remainder
+        while (i < input.len) : (i += 1) {
+            mean_square += input[i] * input[i];
+        }
+        
+        mean_square /= @floatFromInt(input.len);
+        
+        // Normalize
+        const rms = @sqrt(mean_square + eps);
+        const rms_vec: @Vector(VecSize, f32) = @splat(rms);
+        
+        i = 0;
+        while (i < vec_len) : (i += VecSize) {
+            const x: @Vector(VecSize, f32) = input[i..i+VecSize][0..VecSize].*;
+            const w: @Vector(VecSize, f32) = weight[i..i+VecSize][0..VecSize].*;
+            const normalized = (x / rms_vec) * w;
+            output[i..i+VecSize][0..VecSize].* = normalized;
+        }
+        
+        // Handle remainder
+        while (i < input.len) : (i += 1) {
+            output[i] = (input[i] / rms) * weight[i];
+        }
+    }
+    
+    /// SwiGLU activation function with SIMD
+    pub fn swiglu(
+        self: *Self,
+        input: []const f32,
+        gate: []const f32,
+        output: []f32,
+    ) !void {
+        _ = self;
+        
+        const VecSize = if (@import("builtin").cpu.arch == .x86_64) 8 else 4;
+        const vec_len = input.len / VecSize * VecSize;
+        
+        var i: usize = 0;
+        while (i < vec_len) : (i += VecSize) {
+            const x: @Vector(VecSize, f32) = input[i..i+VecSize][0..VecSize].*;
+            const g: @Vector(VecSize, f32) = gate[i..i+VecSize][0..VecSize].*;
+            
+            // SwiGLU: x * (g / (1 + exp(-g)))
+            const ones: @Vector(VecSize, f32) = @splat(1.0);
+            const swish_g = g / (ones + @exp(-g));
+            const result = x * swish_g;
+            
+            output[i..i+VecSize][0..VecSize].* = result;
+        }
+        
+        // Handle remainder
+        while (i < input.len) : (i += 1) {
+            const g_val = gate[i];
+            const swish_val = g_val / (1.0 + @exp(-g_val));
+            output[i] = input[i] * swish_val;
+        }
+    }
+};
+
+/// Create the backend interface
+pub fn init(allocator: Allocator) !deepseek_core.Backend {
+    // For now, return a simple backend struct
+    // In a full implementation, this would create a CpuBackend and wrap it
+    return deepseek_core.Backend.init(allocator, .cpu, 0);
+}
+
+/// Detect CPU capabilities at runtime
+fn detectCapabilities() deepseek_core.Backend.Capabilities {
+    const arch = @import("builtin").cpu.arch;
+    
+    return switch (arch) {
+        .x86_64 => .{
+            .supports_fp16 = true,
+            .supports_bf16 = true, // Check for AVX-512 BF16 in real implementation
+            .supports_int8 = true,
+            .max_memory_gb = 128,
+            .compute_capability = null,
+            .simd_width = 8, // AVX2
+        },
+        .aarch64 => .{
+            .supports_fp16 = true,
+            .supports_bf16 = true, // ARM64 has native BF16 support
+            .supports_int8 = true,
+            .max_memory_gb = 96,
+            .compute_capability = null,
+            .simd_width = 4, // NEON 128-bit
+        },
+        else => .{
+            .supports_fp16 = false,
+            .supports_bf16 = false,
+            .supports_int8 = true,
+            .max_memory_gb = 16,
+            .compute_capability = null,
+            .simd_width = 1,
+        },
+    };
+} 
--- a/experimental/src/backends/cuda/root.zig
+++ b/experimental/src/backends/cuda/root.zig
@ -0,0 +1,297 @@
+// CUDA Backend for DeepSeek V3
+// Optimized for NVIDIA GPUs with Tensor Cores and high-bandwidth memory
+
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+const Allocator = std.mem.Allocator;
+
+/// CUDA backend implementation
+pub const CudaBackend = struct {
+    allocator: Allocator,
+    device_id: u32,
+    device_available: bool,
+    compute_capability: []const u8,
+    memory_gb: u32,
+    
+    const Self = @This();
+    
+    pub fn init(allocator: Allocator, device_id: u32) !Self {
+        // Check if CUDA is available at runtime
+        const cuda_available = detectCudaRuntime();
+        
+        if (cuda_available) {
+            std.log.info("CUDA Backend initialized on device {d}", .{device_id});
+            // TODO: Initialize CUDA context and device
+            // TODO: Query device properties
+        } else {
+            std.log.warn("CUDA Backend not available - no CUDA runtime detected");
+        }
+        
+        return Self{
+            .allocator = allocator,
+            .device_id = device_id,
+            .device_available = cuda_available,
+            .compute_capability = if (cuda_available) "8.0" else "0.0", // H100 default
+            .memory_gb = if (cuda_available) 80 else 0, // H100 80GB
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        if (self.device_available) {
+            // TODO: Cleanup CUDA context and memory
+            std.log.debug("Cleaning up CUDA device {d}", .{self.device_id});
+        }
+    }
+    
+    /// Matrix multiplication using cuBLAS/Tensor Cores
+    pub fn matmul(
+        self: *Self,
+        a: *deepseek_core.Tensor,
+        b: *const deepseek_core.Tensor,
+        c: *deepseek_core.Tensor,
+    ) !void {
+        if (!self.device_available) {
+            return error.CudaNotAvailable;
+        }
+        
+        std.log.debug("CUDA matmul on device {d}: {}x{} * {}x{} -> {}x{}", .{
+            self.device_id,
+            a.shape.dims[0], a.shape.dims[1],
+            b.shape.dims[0], b.shape.dims[1], 
+            c.shape.dims[0], c.shape.dims[1]
+        });
+        
+        // TODO: Implement CUDA matrix multiplication
+        // This would involve:
+        // 1. Allocate GPU memory with cudaMalloc
+        // 2. Copy data to GPU with cudaMemcpy
+        // 3. Call cuBLAS gemm or custom Tensor Core kernel
+        // 4. Copy results back to host
+        // 5. Free GPU memory
+        
+        return error.NotImplemented;
+    }
+    
+    /// RMS normalization using custom CUDA kernel
+    pub fn rmsNorm(
+        self: *Self,
+        input: []const f32,
+        weight: []const f32,
+        output: []f32,
+        eps: f32,
+    ) !void {
+        if (!self.device_available) {
+            return error.CudaNotAvailable;
+        }
+        
+        _ = input;
+        _ = weight;
+        _ = output;
+        _ = eps;
+        
+        std.log.debug("CUDA RMS normalization on device {d}", .{self.device_id});
+        
+        // TODO: Launch CUDA kernel for RMS normalization
+        // GPU excels at parallel reduction and normalization
+        
+        return error.NotImplemented;
+    }
+    
+    /// SwiGLU activation using CUDA
+    pub fn swiglu(
+        self: *Self,
+        input: []const f32,
+        gate: []const f32,
+        output: []f32,
+    ) !void {
+        if (!self.device_available) {
+            return error.CudaNotAvailable;
+        }
+        
+        _ = input;
+        _ = gate;
+        _ = output;
+        
+        std.log.debug("CUDA SwiGLU activation on device {d}", .{self.device_id});
+        
+        // TODO: Launch CUDA kernel for SwiGLU
+        // Element-wise operations are perfect for GPU parallelization
+        
+        return error.NotImplemented;
+    }
+    
+    /// Optimized attention with flash attention
+    pub fn flashAttention(
+        self: *Self,
+        query: *deepseek_core.Tensor,
+        key: *const deepseek_core.Tensor,
+        value: *const deepseek_core.Tensor,
+        output: *deepseek_core.Tensor,
+    ) !void {
+        if (!self.device_available) {
+            return error.CudaNotAvailable;
+        }
+        
+        _ = query;
+        _ = key;
+        _ = value;
+        _ = output;
+        
+        std.log.debug("CUDA Flash Attention on device {d}", .{self.device_id});
+        
+        // TODO: Implement Flash Attention algorithm
+        // This provides memory-efficient attention for long sequences
+        // Critical for DeepSeek V3's 32K context window
+        
+        return error.NotImplemented;
+    }
+    
+    /// Check GPU memory usage
+    pub fn getMemoryInfo(self: *Self) struct { free: u64, total: u64, used: u64 } {
+        if (!self.device_available) {
+            return .{ .free = 0, .total = 0, .used = 0 };
+        }
+        
+        // TODO: Call cudaMemGetInfo to get actual memory usage
+        const total = @as(u64, self.memory_gb) * 1024 * 1024 * 1024;
+        return .{ 
+            .free = total, // TODO: Get actual free memory
+            .total = total,
+            .used = 0, // TODO: Calculate used memory
+        };
+    }
+    
+    /// Synchronize device (wait for all operations to complete)
+    pub fn synchronize(self: *Self) !void {
+        if (!self.device_available) {
+            return;
+        }
+        
+        // TODO: Call cudaDeviceSynchronize()
+        std.log.debug("Synchronizing CUDA device {d}", .{self.device_id});
+    }
+};
+
+/// Create the CUDA backend interface
+pub fn init(allocator: Allocator) !deepseek_core.Backend {
+    // For now, return a simple backend struct
+    // In a full implementation, this would create a CudaBackend and wrap it
+    return deepseek_core.Backend.init(allocator, .cuda, 0);
+}
+
+/// Detect CUDA runtime availability
+fn detectCudaRuntime() bool {
+    // TODO: Check for CUDA library availability
+    // This would involve trying to load libcuda.so/cuda.dll
+    // and checking for basic CUDA functions
+    return false; // Disabled for now
+}
+
+/// CUDA kernel templates (would be compiled with nvcc)
+const cuda_kernels = struct {
+    // Matrix multiplication kernel using Tensor Cores
+    const matmul_kernel = 
+        \\__global__ void matmul_kernel(
+        \\    const float* __restrict__ a,
+        \\    const float* __restrict__ b,
+        \\    float* __restrict__ c,
+        \\    int M, int N, int K
+        \\) {
+        \\    // Use Tensor Cores for mixed precision
+        \\    // This would use wmma API for Tensor Core acceleration
+        \\    int row = blockIdx.y * blockDim.y + threadIdx.y;
+        \\    int col = blockIdx.x * blockDim.x + threadIdx.x;
+        \\    
+        \\    if (row < M && col < N) {
+        \\        float sum = 0.0f;
+        \\        for (int k = 0; k < K; k++) {
+        \\            sum += a[row * K + k] * b[k * N + col];
+        \\        }
+        \\        c[row * N + col] = sum;
+        \\    }
+        \\}
+    ;
+    
+    // RMS normalization kernel with warp-level reduction
+    const rms_norm_kernel = 
+        \\__global__ void rms_norm_kernel(
+        \\    const float* __restrict__ input,
+        \\    const float* __restrict__ weight,
+        \\    float* __restrict__ output,
+        \\    int size,
+        \\    float eps
+        \\) {
+        \\    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+        \\    
+        \\    // Compute mean square using cooperative groups
+        \\    __shared__ float shared_sum[32]; // For warp reduction
+        \\    
+        \\    float thread_sum = 0.0f;
+        \\    for (int i = tid; i < size; i += gridDim.x * blockDim.x) {
+        \\        thread_sum += input[i] * input[i];
+        \\    }
+        \\    
+        \\    // Warp-level reduction
+        \\    for (int mask = 16; mask > 0; mask /= 2) {
+        \\        thread_sum += __shfl_down_sync(0xffffffff, thread_sum, mask);
+        \\    }
+        \\    
+        \\    if (threadIdx.x % 32 == 0) {
+        \\        shared_sum[threadIdx.x / 32] = thread_sum;
+        \\    }
+        \\    __syncthreads();
+        \\    
+        \\    // Final reduction and normalization
+        \\    if (threadIdx.x == 0) {
+        \\        float mean_square = 0.0f;
+        \\        for (int i = 0; i < blockDim.x / 32; i++) {
+        \\            mean_square += shared_sum[i];
+        \\        }
+        \\        mean_square /= size;
+        \\        float rms = sqrtf(mean_square + eps);
+        \\        
+        \\        // Store in shared memory for other threads
+        \\        shared_sum[0] = rms;
+        \\    }
+        \\    __syncthreads();
+        \\    
+        \\    float rms = shared_sum[0];
+        \\    if (tid < size) {
+        \\        output[tid] = (input[tid] / rms) * weight[tid];
+        \\    }
+        \\}
+    ;
+    
+    // SwiGLU activation kernel
+    const swiglu_kernel = 
+        \\__global__ void swiglu_kernel(
+        \\    const float* __restrict__ input,
+        \\    const float* __restrict__ gate,
+        \\    float* __restrict__ output,
+        \\    int size
+        \\) {
+        \\    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+        \\    
+        \\    if (tid < size) {
+        \\        float g = gate[tid];
+        \\        float swish_g = g / (1.0f + expf(-g));
+        \\        output[tid] = input[tid] * swish_g;
+        \\    }
+        \\}
+    ;
+};
+
+/// CUDA device capabilities
+fn getCudaCapabilities(compute_capability: []const u8) deepseek_core.Backend.Capabilities {
+    // Parse compute capability (e.g., "8.0" for H100)
+    const major = std.fmt.parseInt(u8, compute_capability[0..1], 10) catch 0;
+    
+    return .{
+        .supports_fp16 = major >= 6,  // Pascal and newer
+        .supports_bf16 = major >= 8,  // Ampere and newer  
+        .supports_int8 = major >= 6,  // Pascal and newer
+        .max_memory_gb = if (major >= 8) 80 else 24, // H100 vs V100
+        .compute_capability = compute_capability,
+        .simd_width = 32, // CUDA warp size
+    };
+} 
--- a/experimental/src/backends/metal/root.zig
+++ b/experimental/src/backends/metal/root.zig
@ -0,0 +1,230 @@
+// Metal Backend for DeepSeek V3 on Apple Silicon
+// Leverages Metal Performance Shaders and unified memory architecture
+
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+const Allocator = std.mem.Allocator;
+
+/// Metal backend implementation for Apple Silicon
+pub const MetalBackend = struct {
+    allocator: Allocator,
+    device_available: bool,
+    unified_memory_size: u64,
+    
+    const Self = @This();
+    
+    pub fn init(allocator: Allocator) !Self {
+        // Check if Metal is available (compile-time check for macOS)
+        const metal_available = @import("builtin").os.tag == .macos;
+        
+        if (metal_available) {
+            std.log.info("Metal Backend initialized on Apple Silicon");
+            // TODO: Initialize MTLDevice and command queue
+            // TODO: Query unified memory size
+        } else {
+            std.log.warn("Metal Backend not available on this platform");
+        }
+        
+        return Self{
+            .allocator = allocator,
+            .device_available = metal_available,
+            .unified_memory_size = if (metal_available) 16 * 1024 * 1024 * 1024 else 0, // 16GB default
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        // TODO: Release Metal resources
+        _ = self;
+    }
+    
+    /// Matrix multiplication using Metal Performance Shaders
+    pub fn matmul(
+        self: *Self,
+        a: *deepseek_core.Tensor,
+        b: *const deepseek_core.Tensor,
+        c: *deepseek_core.Tensor,
+    ) !void {
+        if (!self.device_available) {
+            return error.MetalNotAvailable;
+        }
+        
+        std.log.debug("Metal matmul: {}x{} * {}x{} -> {}x{}", .{
+            a.shape.dims[0], a.shape.dims[1],
+            b.shape.dims[0], b.shape.dims[1], 
+            c.shape.dims[0], c.shape.dims[1]
+        });
+        
+        // TODO: Implement actual Metal compute shader
+        // This would involve:
+        // 1. Create MTLBuffer from tensor data
+        // 2. Set up compute pipeline with matmul shader
+        // 3. Dispatch compute commands
+        // 4. Copy results back to tensor
+        
+        // For now, fallback to CPU implementation
+        return error.NotImplemented;
+    }
+    
+    /// RMS normalization using Metal compute shader
+    pub fn rmsNorm(
+        self: *Self,
+        input: []const f32,
+        weight: []const f32,
+        output: []f32,
+        eps: f32,
+    ) !void {
+        if (!self.device_available) {
+            return error.MetalNotAvailable;
+        }
+        
+        _ = input;
+        _ = weight;
+        _ = output;
+        _ = eps;
+        
+        std.log.debug("Metal RMS normalization");
+        
+        // TODO: Implement Metal compute shader for RMS norm
+        // Metal excels at parallel operations like normalization
+        
+        return error.NotImplemented;
+    }
+    
+    /// SwiGLU activation using Metal
+    pub fn swiglu(
+        self: *Self,
+        input: []const f32,
+        gate: []const f32,
+        output: []f32,
+    ) !void {
+        if (!self.device_available) {
+            return error.MetalNotAvailable;
+        }
+        
+        _ = input;
+        _ = gate;
+        _ = output;
+        
+        std.log.debug("Metal SwiGLU activation");
+        
+        // TODO: Implement Metal compute shader for SwiGLU
+        // GPU is perfect for element-wise operations like activations
+        
+        return error.NotImplemented;
+    }
+    
+    /// Attention mechanism optimized for Apple Silicon
+    pub fn attention(
+        self: *Self,
+        query: *deepseek_core.Tensor,
+        key: *const deepseek_core.Tensor,
+        value: *const deepseek_core.Tensor,
+        output: *deepseek_core.Tensor,
+    ) !void {
+        if (!self.device_available) {
+            return error.MetalNotAvailable;
+        }
+        
+        _ = query;
+        _ = key;
+        _ = value;
+        _ = output;
+        
+        std.log.debug("Metal attention mechanism");
+        
+        // TODO: Implement optimized attention for Apple Silicon
+        // This would leverage:
+        // - Unified memory for zero-copy operations
+        // - Metal Performance Shaders for optimized GEMM
+        // - Custom shaders for attention-specific operations
+        
+        return error.NotImplemented;
+    }
+    
+    /// Check GPU memory usage
+    pub fn getMemoryInfo(self: *Self) struct { used: u64, total: u64 } {
+        if (!self.device_available) {
+            return .{ .used = 0, .total = 0 };
+        }
+        
+        // TODO: Query actual Metal device memory usage
+        return .{ 
+            .used = 0, // TODO: Get current usage
+            .total = self.unified_memory_size,
+        };
+    }
+};
+
+/// Create the Metal backend interface
+pub fn init(allocator: Allocator) !deepseek_core.Backend {
+    // For now, return a simple backend struct
+    // In a full implementation, this would create a MetalBackend and wrap it
+    return deepseek_core.Backend.init(allocator, .metal, 0);
+}
+
+/// Metal compute shader templates (would be loaded from .metal files)
+const metal_shaders = struct {
+    // Matrix multiplication shader (simplified)
+    const matmul_shader = 
+        \\#include <metal_stdlib>
+        \\using namespace metal;
+        \\
+        \\kernel void matmul_kernel(
+        \\    device const float* a [[buffer(0)]],
+        \\    device const float* b [[buffer(1)]],
+        \\    device float* c [[buffer(2)]],
+        \\    constant uint& M [[buffer(3)]],
+        \\    constant uint& N [[buffer(4)]],
+        \\    constant uint& K [[buffer(5)]],
+        \\    uint2 gid [[thread_position_in_grid]]
+        \\) {
+        \\    if (gid.x >= N || gid.y >= M) return;
+        \\    
+        \\    float sum = 0.0;
+        \\    for (uint k = 0; k < K; k++) {
+        \\        sum += a[gid.y * K + k] * b[k * N + gid.x];
+        \\    }
+        \\    c[gid.y * N + gid.x] = sum;
+        \\}
+    ;
+    
+    // RMS normalization shader
+    const rms_norm_shader = 
+        \\#include <metal_stdlib>
+        \\using namespace metal;
+        \\
+        \\kernel void rms_norm_kernel(
+        \\    device const float* input [[buffer(0)]],
+        \\    device const float* weight [[buffer(1)]],
+        \\    device float* output [[buffer(2)]],
+        \\    constant uint& size [[buffer(3)]],
+        \\    constant float& eps [[buffer(4)]],
+        \\    uint gid [[thread_position_in_grid]]
+        \\) {
+        \\    // Simplified RMS norm - would need proper reduction
+        \\    if (gid >= size) return;
+        \\    
+        \\    // TODO: Implement proper parallel reduction for mean square
+        \\    float mean_square = 0.0;
+        \\    for (uint i = 0; i < size; i++) {
+        \\        mean_square += input[i] * input[i];
+        \\    }
+        \\    mean_square /= size;
+        \\    
+        \\    float rms = sqrt(mean_square + eps);
+        \\    output[gid] = (input[gid] / rms) * weight[gid];
+        \\}
+    ;
+};
+
+/// Capabilities for Apple Silicon
+fn getAppleSiliconCapabilities() deepseek_core.Backend.Capabilities {
+    return .{
+        .supports_fp16 = true,  // Native fp16 support
+        .supports_bf16 = true,  // M3+ supports bf16
+        .supports_int8 = true,  // Efficient int8 operations
+        .max_memory_gb = 128,   // Up to 128GB unified memory on Mac Studio
+        .compute_capability = null,
+        .simd_width = 32,       // Metal SIMD-group size
+    };
+} 
--- a/experimental/src/core/attention.zig
+++ b/experimental/src/core/attention.zig
@ -0,0 +1,14 @@
+const std = @import("std");
+
+/// Multi-Head Latent Attention (MLA) for DeepSeek V3
+pub const Attention = struct {
+    // TODO: Implement MLA attention mechanism
+    
+    pub fn init() Attention {
+        return Attention{};
+    }
+    
+    pub fn deinit(self: *Attention) void {
+        _ = self;
+    }
+}; 
--- a/experimental/src/core/backend.zig
+++ b/experimental/src/core/backend.zig
@ -0,0 +1,88 @@
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+
+/// Backend types supported by DeepSeek V3
+pub const BackendType = enum {
+    cpu,
+    metal,
+    cuda,
+    webgpu,
+};
+
+/// Backend capabilities
+pub const Capabilities = struct {
+    supports_fp16: bool,
+    supports_bf16: bool,
+    supports_int8: bool,
+    max_memory_gb: u32,
+    compute_capability: ?[]const u8, // For CUDA
+    simd_width: u32,
+};
+
+/// Backend interface for different compute backends
+pub const Backend = struct {
+    type: BackendType,
+    device_id: u32,
+    allocator: Allocator,
+    
+    const Self = @This();
+    
+    pub fn init(allocator: Allocator, backend_type: BackendType, device_id: u32) Self {
+        return Self{
+            .type = backend_type,
+            .device_id = device_id,
+            .allocator = allocator,
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        // TODO: Backend-specific cleanup
+        _ = self;
+    }
+    
+    pub fn capabilities(self: *const Self) Capabilities {
+        return switch (self.type) {
+            .cpu => Capabilities{
+                .supports_fp16 = true,
+                .supports_bf16 = true,
+                .supports_int8 = true,
+                .max_memory_gb = 128, // Typical system RAM
+                .compute_capability = null,
+                .simd_width = if (@import("builtin").cpu.arch == .x86_64) 8 else 4,
+            },
+            .metal => Capabilities{
+                .supports_fp16 = true,
+                .supports_bf16 = true,
+                .supports_int8 = true,
+                .max_memory_gb = 96, // Apple Silicon unified memory
+                .compute_capability = null,
+                .simd_width = 16, // Metal SIMD groups
+            },
+            .cuda => Capabilities{
+                .supports_fp16 = true,
+                .supports_bf16 = true,
+                .supports_int8 = true,
+                .max_memory_gb = 80, // H100 VRAM
+                .compute_capability = "8.0", // TODO: Detect actual capability
+                .simd_width = 32, // CUDA warp size
+            },
+            .webgpu => Capabilities{
+                .supports_fp16 = false, // Limited support
+                .supports_bf16 = false,
+                .supports_int8 = false,
+                .max_memory_gb = 4, // Browser limitations
+                .compute_capability = null,
+                .simd_width = 1,
+            },
+        };
+    }
+    
+    pub fn name(self: *const Self) []const u8 {
+        return switch (self.type) {
+            .cpu => "CPU",
+            .metal => "Metal",
+            .cuda => "CUDA",
+            .webgpu => "WebGPU",
+        };
+    }
+}; 
--- a/experimental/src/core/config.zig
+++ b/experimental/src/core/config.zig
@ -0,0 +1,13 @@
+const std = @import("std");
+
+/// Global configuration for DeepSeek V3
+pub const Config = struct {
+    log_level: std.log.Level = .info,
+    enable_telemetry: bool = false,
+    cache_dir: ?[]const u8 = null,
+    
+    pub fn loadFromEnv() Config {
+        // TODO: Load configuration from environment variables
+        return Config{};
+    }
+}; 
--- a/experimental/src/core/math/activation.zig
+++ b/experimental/src/core/math/activation.zig
@ -0,0 +1,33 @@
+const std = @import("std");
+
+/// SwiGLU activation function used in DeepSeek V3
+pub fn swiglu(x: f32, gate: f32) f32 {
+    return x * swish(gate);
+}
+
+/// Swish activation (SiLU)
+pub fn swish(x: f32) f32 {
+    return x / (1.0 + @exp(-x));
+}
+
+/// GELU activation
+pub fn gelu(x: f32) f32 {
+    const tanh_arg = 0.7978845608 * (x + 0.044715 * x * x * x);
+    return 0.5 * x * (1.0 + std.math.tanh(tanh_arg));
+}
+
+/// ReLU activation
+pub fn relu(x: f32) f32 {
+    return @max(0.0, x);
+}
+
+/// Vectorized SwiGLU for SIMD
+pub fn swigluVec(comptime size: comptime_int, x: @Vector(size, f32), gate: @Vector(size, f32)) @Vector(size, f32) {
+    return x * swishVec(size, gate);
+}
+
+/// Vectorized Swish for SIMD
+pub fn swishVec(comptime size: comptime_int, x: @Vector(size, f32)) @Vector(size, f32) {
+    const ones: @Vector(size, f32) = @splat(1.0);
+    return x / (ones + @exp(-x));
+} 
--- a/experimental/src/core/math/rms_norm.zig
+++ b/experimental/src/core/math/rms_norm.zig
@ -0,0 +1,64 @@
+const std = @import("std");
+
+/// RMS Normalization used in DeepSeek V3
+pub fn rmsNorm(input: []const f32, weight: []const f32, output: []f32, eps: f32) void {
+    std.debug.assert(input.len == weight.len);
+    std.debug.assert(input.len == output.len);
+    
+    // Compute mean square
+    var mean_square: f32 = 0.0;
+    for (input) |x| {
+        mean_square += x * x;
+    }
+    mean_square /= @floatFromInt(input.len);
+    
+    // Compute RMS and normalize
+    const rms = @sqrt(mean_square + eps);
+    for (0..input.len) |i| {
+        output[i] = (input[i] / rms) * weight[i];
+    }
+}
+
+/// Vectorized RMS normalization for better performance
+pub fn rmsNormVec(input: []const f32, weight: []const f32, output: []f32, eps: f32) void {
+    const VecSize = 8;
+    const vec_len = input.len / VecSize * VecSize;
+    
+    // Compute mean square using SIMD
+    var sum_squares: @Vector(VecSize, f32) = @splat(0.0);
+    var i: usize = 0;
+    while (i < vec_len) : (i += VecSize) {
+        const x: @Vector(VecSize, f32) = input[i..i+VecSize][0..VecSize].*;
+        sum_squares += x * x;
+    }
+    
+    // Sum the vector elements
+    var mean_square: f32 = 0.0;
+    for (0..VecSize) |j| {
+        mean_square += sum_squares[j];
+    }
+    
+    // Handle remainder
+    while (i < input.len) : (i += 1) {
+        mean_square += input[i] * input[i];
+    }
+    
+    mean_square /= @floatFromInt(input.len);
+    
+    // Normalize using SIMD
+    const rms = @sqrt(mean_square + eps);
+    const rms_vec: @Vector(VecSize, f32) = @splat(rms);
+    
+    i = 0;
+    while (i < vec_len) : (i += VecSize) {
+        const x: @Vector(VecSize, f32) = input[i..i+VecSize][0..VecSize].*;
+        const w: @Vector(VecSize, f32) = weight[i..i+VecSize][0..VecSize].*;
+        const normalized = (x / rms_vec) * w;
+        output[i..i+VecSize][0..VecSize].* = normalized;
+    }
+    
+    // Handle remainder
+    while (i < input.len) : (i += 1) {
+        output[i] = (input[i] / rms) * weight[i];
+    }
+} 
--- a/experimental/src/core/math/root.zig
+++ b/experimental/src/core/math/root.zig
@ -0,0 +1,13 @@
+const std = @import("std");
+
+// Math utilities for DeepSeek V3
+pub const simd = @import("simd.zig");
+pub const activation = @import("activation.zig");
+pub const rms_norm = @import("rms_norm.zig");
+
+// Re-export common math functions
+pub const sqrt = std.math.sqrt;
+pub const exp = std.math.exp;
+pub const tanh = std.math.tanh;
+pub const sin = std.math.sin;
+pub const cos = std.math.cos; 
--- a/experimental/src/core/math/simd.zig
+++ b/experimental/src/core/math/simd.zig
@ -0,0 +1,25 @@
+const std = @import("std");
+
+/// SIMD utilities for high-performance computation
+pub fn vectorAdd(comptime T: type, comptime size: comptime_int, a: @Vector(size, T), b: @Vector(size, T)) @Vector(size, T) {
+    return a + b;
+}
+
+pub fn vectorMul(comptime T: type, comptime size: comptime_int, a: @Vector(size, T), b: @Vector(size, T)) @Vector(size, T) {
+    return a * b;
+}
+
+pub fn vectorFma(comptime T: type, comptime size: comptime_int, a: @Vector(size, T), b: @Vector(size, T), c: @Vector(size, T)) @Vector(size, T) {
+    return @mulAdd(@Vector(size, T), a, b, c);
+}
+
+/// Horizontal sum of vector elements
+pub fn horizontalSum(comptime T: type, comptime size: comptime_int, vec: @Vector(size, T)) T {
+    if (size == 1) return vec[0];
+    
+    var result: T = 0;
+    for (0..size) |i| {
+        result += vec[i];
+    }
+    return result;
+} 
--- a/experimental/src/core/memory.zig
+++ b/experimental/src/core/memory.zig
@ -0,0 +1,35 @@
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+
+/// Arena allocator for request-scoped memory
+pub const ArenaAllocator = std.heap.ArenaAllocator;
+
+/// Memory pool for tensor allocations
+pub const TensorPool = struct {
+    allocator: Allocator,
+    pool: std.ArrayList([]u8),
+    
+    pub fn init(allocator: Allocator) TensorPool {
+        return TensorPool{
+            .allocator = allocator,
+            .pool = std.ArrayList([]u8).init(allocator),
+        };
+    }
+    
+    pub fn deinit(self: *TensorPool) void {
+        for (self.pool.items) |buf| {
+            self.allocator.free(buf);
+        }
+        self.pool.deinit();
+    }
+    
+    pub fn alloc(self: *TensorPool, size: usize) ![]u8 {
+        // TODO: Implement memory pooling
+        return try self.allocator.alloc(u8, size);
+    }
+    
+    pub fn free(self: *TensorPool, buf: []u8) void {
+        // TODO: Return to pool instead of freeing
+        self.allocator.free(buf);
+    }
+}; 
--- a/experimental/src/core/model.zig
+++ b/experimental/src/core/model.zig
@ -0,0 +1,296 @@
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+const Tensor = @import("tensor.zig").Tensor;
+const Shape = @import("tensor.zig").Shape;
+const Transformer = @import("transformer.zig").Transformer;
+const Tokenizer = @import("tokenizer.zig").Tokenizer;
+const Backend = @import("backend.zig").Backend;
+const CoreError = @import("root.zig").CoreError;
+
+pub const ModelError = CoreError || error{
+    InvalidModelFile,
+    UnsupportedModelVersion,
+    CorruptedWeights,
+    MissingTokenizer,
+};
+
+/// Model configuration matching DeepSeek V3 architecture
+pub const ModelConfig = struct {
+    // Model dimensions
+    vocab_size: u32,
+    hidden_size: u32,
+    intermediate_size: u32,
+    num_hidden_layers: u32,
+    num_attention_heads: u32,
+    num_key_value_heads: u32,
+    max_position_embeddings: u32,
+    
+    // MoE configuration
+    num_experts: u32,
+    num_experts_per_token: u32,
+    expert_capacity: u32,
+    
+    // Multi-head Latent Attention (MLA) config
+    qk_nope_head_dim: u32,
+    qk_rope_head_dim: u32,
+    v_head_dim: u32,
+    qk_rope_base: f32,
+    
+    // Activation function
+    hidden_act: []const u8, // "swiglu" for DeepSeek V3
+    
+    // Normalization
+    rms_norm_eps: f32,
+    
+    // Quantization settings
+    use_fp16: bool,
+    use_bf16: bool,
+    
+    pub fn deepseekV3Default() ModelConfig {
+        return ModelConfig{
+            .vocab_size = 129280,
+            .hidden_size = 7168,
+            .intermediate_size = 18432,
+            .num_hidden_layers = 61,
+            .num_attention_heads = 128,
+            .num_key_value_heads = 128,
+            .max_position_embeddings = 32768,
+            .num_experts = 256,
+            .num_experts_per_token = 8,
+            .expert_capacity = 64,
+            .qk_nope_head_dim = 128,
+            .qk_rope_head_dim = 64,
+            .v_head_dim = 128,
+            .qk_rope_base = 10000.0,
+            .hidden_act = "swiglu",
+            .rms_norm_eps = 1e-6,
+            .use_fp16 = false,
+            .use_bf16 = true,
+        };
+    }
+};
+
+/// Model information
+pub const ModelInfo = struct {
+    name: []const u8,
+    version: []const u8,
+    config: ModelConfig,
+    num_parameters: u64,
+    memory_usage: u64,
+};
+
+/// DeepSeek V3 Model
+pub const Model = struct {
+    config: ModelConfig,
+    transformer: Transformer,
+    tokenizer: Tokenizer,
+    backend: Backend,
+    allocator: Allocator,
+    
+    // Embedding layers
+    embed_tokens: Tensor,
+    embed_positions: ?Tensor,
+    
+    // Output layers
+    lm_head: Tensor,
+    norm: Tensor,
+    
+    const Self = @This();
+    
+    /// Load model from file path
+    pub fn loadFromPath(allocator: Allocator, path: []const u8, backend: Backend) !Self {
+        std.log.info("Loading DeepSeek V3 model from: {s}", .{path});
+        
+        // TODO: Implement model loading from file
+        // For now, create a default model
+        return loadDefault(allocator, backend);
+    }
+    
+    /// Load default/demo model
+    pub fn loadDefault(allocator: Allocator, backend: Backend) !Self {
+        const config = ModelConfig.deepseekV3Default();
+        
+        std.log.info("Creating default DeepSeek V3 model...", .{});
+        std.log.info("  Hidden size: {}", .{config.hidden_size});
+        std.log.info("  Layers: {}", .{config.num_hidden_layers});
+        std.log.info("  Experts: {}", .{config.num_experts});
+        std.log.info("  Vocab size: {}", .{config.vocab_size});
+        
+        // Initialize transformer
+        const transformer = try Transformer.init(allocator, config, backend);
+        
+        // Initialize tokenizer
+        const tokenizer = try Tokenizer.init(allocator, config.vocab_size);
+        
+        // Initialize embedding layers
+        const embed_shape = Shape.init(&[_]u32{ config.vocab_size, config.hidden_size });
+        var embed_tokens = try Tensor.init(allocator, embed_shape, .f32);
+        
+        // Initialize with random values (in real implementation, load from weights)
+        try initializeEmbedding(&embed_tokens);
+        
+        // Output projection
+        const lm_head_shape = Shape.init(&[_]u32{ config.hidden_size, config.vocab_size });
+        var lm_head = try Tensor.init(allocator, lm_head_shape, .f32);
+        try initializeLinear(&lm_head);
+        
+        // Final layer norm
+        const norm_shape = Shape.init(&[_]u32{config.hidden_size});
+        const norm = try Tensor.ones(allocator, norm_shape, .f32);
+        
+        return Self{
+            .config = config,
+            .transformer = transformer,
+            .tokenizer = tokenizer,
+            .backend = backend,
+            .allocator = allocator,
+            .embed_tokens = embed_tokens,
+            .embed_positions = null,
+            .lm_head = lm_head,
+            .norm = norm,
+        };
+    }
+    
+    /// Free model memory
+    pub fn deinit(self: *Self) void {
+        self.transformer.deinit();
+        self.tokenizer.deinit();
+        self.embed_tokens.deinit();
+        if (self.embed_positions) |*pos| pos.deinit();
+        self.lm_head.deinit();
+        self.norm.deinit();
+    }
+    
+    /// Get model information
+    pub fn info(self: *const Self) ModelInfo {
+        const num_params = self.estimateParameters();
+        const memory_usage = self.estimateMemoryUsage();
+        
+        return ModelInfo{
+            .name = "DeepSeek V3",
+            .version = "0.1.0",
+            .config = self.config,
+            .num_parameters = num_params,
+            .memory_usage = memory_usage,
+        };
+    }
+    
+    /// Generate text completion
+    pub fn generate(self: *Self, input_tokens: []const u32, max_tokens: u32) ![]u32 {
+        _ = self;
+        _ = input_tokens;
+        _ = max_tokens;
+        
+        // TODO: Implement actual generation
+        // This would involve:
+        // 1. Run forward pass through transformer layers
+        // 2. Apply final layer norm and output projection
+        // 3. Sample next token from logits
+        // 4. Repeat until max_tokens or EOS
+        
+        std.log.debug("Generation not yet implemented");
+        return error.NotImplemented;
+    }
+    
+    /// Forward pass through the model
+    pub fn forward(
+        self: *Self,
+        input_ids: []const u32,
+        output: *Tensor,
+    ) !void {
+        // TODO: Implement forward pass
+        // 1. Embedding lookup
+        // 2. Transformer forward pass
+        // 3. Final layer norm
+        // 4. Language model head
+        
+        _ = self;
+        _ = input_ids;
+        _ = output;
+        
+        std.log.debug("Model forward pass (placeholder)");
+    }
+    
+    /// Estimate model parameters
+    fn estimateParameters(self: *const Self) u64 {
+        var params: u64 = 0;
+        
+        // Embedding parameters
+        params += @as(u64, self.config.vocab_size) * self.config.hidden_size;
+        
+        // Transformer parameters (rough estimate)
+        const layer_params = @as(u64, self.config.hidden_size) * self.config.hidden_size * 4; // Attention + FFN
+        params += layer_params * self.config.num_hidden_layers;
+        
+        // MoE parameters
+        const expert_params = @as(u64, self.config.hidden_size) * self.config.intermediate_size * 2;
+        params += expert_params * self.config.num_experts;
+        
+        // Output head
+        params += @as(u64, self.config.hidden_size) * self.config.vocab_size;
+        
+        return params;
+    }
+    
+    /// Estimate memory usage in bytes
+    fn estimateMemoryUsage(self: *const Self) u64 {
+        const params = self.estimateParameters();
+        const dtype_size: u64 = if (self.config.use_fp16 or self.config.use_bf16) 2 else 4;
+        
+        // Model weights + activation memory + KV cache
+        return params * dtype_size * 2; // Rough estimate
+    }
+};
+
+// Initialize embedding with small random values
+fn initializeEmbedding(tensor: *Tensor) !void {
+    const data = try tensor.asSliceF32();
+    var rng = std.Random.DefaultPrng.init(42);
+    const random = rng.random();
+    
+    for (data) |*val| {
+        val.* = (random.float(f32) - 0.5) * 0.02; // Small random values
+    }
+}
+
+// Initialize linear layer with Xavier initialization
+fn initializeLinear(tensor: *Tensor) !void {
+    const data = try tensor.asSliceF32();
+    var rng = std.Random.DefaultPrng.init(123);
+    const random = rng.random();
+    
+    const fan_in = tensor.shape.dims[0];
+    const fan_out = tensor.shape.dims[1];
+    const limit = std.math.sqrt(6.0 / @as(f32, @floatFromInt(fan_in + fan_out)));
+    
+    for (data) |*val| {
+        val.* = (random.float(f32) - 0.5) * 2.0 * limit;
+    }
+}
+
+// Tests
+test "model creation" {
+    const testing = std.testing;
+    const allocator = testing.allocator;
+    
+    // Create a dummy backend for testing
+    const backend = Backend{
+        .type = .cpu,
+        .device_id = 0,
+        .allocator = allocator,
+    };
+    
+    var model = try Model.loadDefault(allocator, backend);
+    defer model.deinit();
+    
+    const model_info = model.info();
+    try testing.expect(model_info.num_parameters > 0);
+    try testing.expect(std.mem.eql(u8, model_info.name, "DeepSeek V3"));
+}
+
+test "model config" {
+    const config = ModelConfig.deepseekV3Default();
+    std.testing.expect(config.vocab_size == 129280) catch unreachable;
+    std.testing.expect(config.num_experts == 256) catch unreachable;
+    std.testing.expect(config.num_experts_per_token == 8) catch unreachable;
+} 
--- a/experimental/src/core/moe.zig
+++ b/experimental/src/core/moe.zig
@ -0,0 +1,14 @@
+const std = @import("std");
+
+/// Mixture of Experts implementation for DeepSeek V3
+pub const MoE = struct {
+    // TODO: Implement MoE routing and expert selection
+    
+    pub fn init() MoE {
+        return MoE{};
+    }
+    
+    pub fn deinit(self: *MoE) void {
+        _ = self;
+    }
+};
--- a/experimental/src/core/root.zig
+++ b/experimental/src/core/root.zig
@ -0,0 +1,61 @@
+// DeepSeek V3 Core Module
+// This module contains the fundamental components for LLM inference
+
+const std = @import("std");
+
+// Core components
+pub const Tensor = @import("tensor.zig").Tensor;
+pub const Model = @import("model.zig").Model;
+pub const Transformer = @import("transformer.zig").Transformer;
+pub const Attention = @import("attention.zig").Attention;
+pub const MoE = @import("moe.zig").MoE;
+pub const Tokenizer = @import("tokenizer.zig").Tokenizer;
+pub const Backend = @import("backend.zig").Backend;
+
+// Math utilities
+pub const math = @import("math/root.zig");
+
+// Memory management
+pub const memory = @import("memory.zig");
+
+// Configuration
+pub const Config = @import("config.zig").Config;
+
+// Error types
+pub const CoreError = error{
+    InvalidTensorShape,
+    UnsupportedOperation,
+    ModelLoadError,
+    TokenizerError,
+    BackendError,
+    OutOfMemory,
+    InvalidConfiguration,
+};
+
+// Version information
+pub const version = struct {
+    pub const major = 0;
+    pub const minor = 1;
+    pub const patch = 0;
+    pub const string = "0.1.0";
+};
+
+// Core test suite
+test "core module" {
+    const testing = std.testing;
+    
+    // Basic smoke tests
+    try testing.expect(version.major == 0);
+    try testing.expect(version.minor == 1);
+}
+
+// Utility functions
+pub fn init() void {
+    // TODO: Initialize any global state if needed
+    std.log.info("DeepSeek V3 Core initialized (v{s})", .{version.string});
+}
+
+pub fn deinit() void {
+    // TODO: Cleanup any global state
+    std.log.info("DeepSeek V3 Core deinitialized");
+} 
--- a/experimental/src/core/tensor.zig
+++ b/experimental/src/core/tensor.zig
@ -0,0 +1,312 @@
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+const CoreError = @import("root.zig").CoreError;
+
+pub const TensorError = CoreError || error{
+    ShapeMismatch,
+    InvalidDimension,
+    BufferTooSmall,
+};
+
+/// Shape of a tensor - maximum 8 dimensions for DeepSeek V3
+pub const Shape = struct {
+    dims: [8]u32,
+    ndim: u8,
+    
+    pub fn init(dimensions: []const u32) Shape {
+        var shape = Shape{
+            .dims = [_]u32{0} ** 8,
+            .ndim = @intCast(dimensions.len),
+        };
+        for (dimensions, 0..) |dim, i| {
+            shape.dims[i] = dim;
+        }
+        return shape;
+    }
+    
+    pub fn numel(self: Shape) u64 {
+        var total: u64 = 1;
+        for (0..self.ndim) |i| {
+            total *= self.dims[i];
+        }
+        return total;
+    }
+    
+    pub fn equals(self: Shape, other: Shape) bool {
+        if (self.ndim != other.ndim) return false;
+        for (0..self.ndim) |i| {
+            if (self.dims[i] != other.dims[i]) return false;
+        }
+        return true;
+    }
+    
+    pub fn format(
+        self: Shape,
+        comptime fmt: []const u8,
+        options: std.fmt.FormatOptions,
+        writer: anytype,
+    ) !void {
+        _ = fmt;
+        _ = options;
+        try writer.print("Shape([");
+        for (0..self.ndim) |i| {
+            if (i > 0) try writer.print(", ");
+            try writer.print("{}", .{self.dims[i]});
+        }
+        try writer.print("])");
+    }
+};
+
+/// Tensor data type
+pub const DType = enum {
+    f32,
+    f16,
+    bf16,
+    i32,
+    u32,
+    i8,
+    u8,
+    
+    pub fn size(self: DType) u8 {
+        return switch (self) {
+            .f32, .i32, .u32 => 4,
+            .f16, .bf16 => 2,
+            .i8, .u8 => 1,
+        };
+    }
+};
+
+/// Multi-dimensional tensor with SIMD optimizations
+pub const Tensor = struct {
+    data: []u8,
+    shape: Shape,
+    dtype: DType,
+    allocator: Allocator,
+    
+    const Self = @This();
+    
+    /// Create a new tensor with given shape and data type
+    pub fn init(allocator: Allocator, shape: Shape, dtype: DType) !Self {
+        const size = shape.numel() * dtype.size();
+        const data = try allocator.alloc(u8, size);
+        @memset(data, 0);
+        
+        return Self{
+            .data = data,
+            .shape = shape,
+            .dtype = dtype,
+            .allocator = allocator,
+        };
+    }
+    
+    /// Create tensor from existing data (takes ownership)
+    pub fn fromData(allocator: Allocator, data: []u8, shape: Shape, dtype: DType) !Self {
+        const expected_size = shape.numel() * dtype.size();
+        if (data.len != expected_size) {
+            return TensorError.BufferTooSmall;
+        }
+        
+        return Self{
+            .data = data,
+            .shape = shape,
+            .dtype = dtype,
+            .allocator = allocator,
+        };
+    }
+    
+    /// Create tensor filled with zeros
+    pub fn zeros(allocator: Allocator, shape: Shape, dtype: DType) !Self {
+        return init(allocator, shape, dtype);
+    }
+    
+    /// Create tensor filled with ones
+    pub fn ones(allocator: Allocator, shape: Shape, dtype: DType) !Self {
+        var tensor = try init(allocator, shape, dtype);
+        try tensor.fill(1.0);
+        return tensor;
+    }
+    
+    /// Free tensor memory
+    pub fn deinit(self: *Self) void {
+        self.allocator.free(self.data);
+    }
+    
+    /// Fill tensor with a scalar value
+    pub fn fill(self: *Self, value: f32) !void {
+        switch (self.dtype) {
+            .f32 => {
+                const data_f32 = @as([]f32, @alignCast(std.mem.bytesAsSlice(f32, self.data)));
+                @memset(data_f32, value);
+            },
+            .f16 => {
+                const data_f16 = @as([]f16, @alignCast(std.mem.bytesAsSlice(f16, self.data)));
+                @memset(data_f16, @floatCast(value));
+            },
+            .i32 => {
+                const data_i32 = @as([]i32, @alignCast(std.mem.bytesAsSlice(i32, self.data)));
+                @memset(data_i32, @intFromFloat(value));
+            },
+            else => return TensorError.UnsupportedOperation,
+        }
+    }
+    
+    /// Get tensor as typed slice (f32)
+    pub fn asSliceF32(self: *Self) ![]f32 {
+        if (self.dtype != .f32) return TensorError.UnsupportedOperation;
+        return @as([]f32, @alignCast(std.mem.bytesAsSlice(f32, self.data)));
+    }
+    
+    /// Get tensor as typed slice (f16)
+    pub fn asSliceF16(self: *Self) ![]f16 {
+        if (self.dtype != .f16) return TensorError.UnsupportedOperation;
+        return @as([]f16, @alignCast(std.mem.bytesAsSlice(f16, self.data)));
+    }
+    
+    /// Element-wise addition (SIMD optimized)
+    pub fn add(self: *Self, other: *const Self, result: *Self) !void {
+        if (!self.shape.equals(other.shape) or !self.shape.equals(result.shape)) {
+            return TensorError.ShapeMismatch;
+        }
+        if (self.dtype != other.dtype or self.dtype != result.dtype) {
+            return TensorError.UnsupportedOperation;
+        }
+        
+        switch (self.dtype) {
+            .f32 => try addF32SIMD(self.data, other.data, result.data),
+            .f16 => try addF16(self.data, other.data, result.data),
+            else => return TensorError.UnsupportedOperation,
+        }
+    }
+    
+    /// Matrix multiplication (optimized for transformers)
+    pub fn matmul(self: *Self, other: *const Self, result: *Self) !void {
+        if (self.shape.ndim != 2 or other.shape.ndim != 2 or result.shape.ndim != 2) {
+            return TensorError.InvalidDimension;
+        }
+        
+        const m = self.shape.dims[0];
+        const k = self.shape.dims[1];
+        const n = other.shape.dims[1];
+        
+        if (other.shape.dims[0] != k or result.shape.dims[0] != m or result.shape.dims[1] != n) {
+            return TensorError.ShapeMismatch;
+        }
+        
+        switch (self.dtype) {
+            .f32 => try matmulF32(self, other, result),
+            else => return TensorError.UnsupportedOperation,
+        }
+    }
+    
+    pub fn format(
+        self: Self,
+        comptime fmt: []const u8,
+        options: std.fmt.FormatOptions,
+        writer: anytype,
+    ) !void {
+        _ = fmt;
+        _ = options;
+        try writer.print("Tensor({}, {})", .{ self.shape, @tagName(self.dtype) });
+    }
+};
+
+// SIMD optimized addition for f32
+fn addF32SIMD(a: []const u8, b: []const u8, result: []u8) !void {
+    const a_f32 = @as([]const f32, @alignCast(std.mem.bytesAsSlice(f32, a)));
+    const b_f32 = @as([]const f32, @alignCast(std.mem.bytesAsSlice(f32, b)));
+    const result_f32 = @as([]f32, @alignCast(std.mem.bytesAsSlice(f32, result)));
+    
+    const VecSize = 8; // AVX2 can process 8 f32s at once
+    const vec_len = a_f32.len / VecSize * VecSize;
+    
+    // SIMD loop
+    var i: usize = 0;
+    while (i < vec_len) : (i += VecSize) {
+        const va: @Vector(VecSize, f32) = a_f32[i..i+VecSize][0..VecSize].*;
+        const vb: @Vector(VecSize, f32) = b_f32[i..i+VecSize][0..VecSize].*;
+        const vr = va + vb;
+        result_f32[i..i+VecSize][0..VecSize].* = vr;
+    }
+    
+    // Handle remainder
+    while (i < a_f32.len) : (i += 1) {
+        result_f32[i] = a_f32[i] + b_f32[i];
+    }
+}
+
+// Basic f16 addition (can be optimized with ARM NEON)
+fn addF16(a: []const u8, b: []const u8, result: []u8) !void {
+    const a_f16 = @as([]const f16, @alignCast(std.mem.bytesAsSlice(f16, a)));
+    const b_f16 = @as([]const f16, @alignCast(std.mem.bytesAsSlice(f16, b)));
+    const result_f16 = @as([]f16, @alignCast(std.mem.bytesAsSlice(f16, result)));
+    
+    for (0..a_f16.len) |i| {
+        result_f16[i] = a_f16[i] + b_f16[i];
+    }
+}
+
+// Optimized matrix multiplication for transformers
+fn matmulF32(a: *Tensor, b: *const Tensor, c: *Tensor) !void {
+    const a_data = try a.asSliceF32();
+    const b_data = @as([]const f32, @alignCast(std.mem.bytesAsSlice(f32, b.data)));
+    const c_data = try c.asSliceF32();
+    
+    const m = a.shape.dims[0];
+    const k = a.shape.dims[1];
+    const n = b.shape.dims[1];
+    
+    // TODO: Implement blocked matrix multiplication with SIMD
+    // For now, simple triple loop
+    for (0..m) |i| {
+        for (0..n) |j| {
+            var sum: f32 = 0.0;
+            for (0..k) |l| {
+                sum += a_data[i * k + l] * b_data[l * n + j];
+            }
+            c_data[i * n + j] = sum;
+        }
+    }
+}
+
+// Tests
+test "tensor creation and basic operations" {
+    const testing = std.testing;
+    const allocator = testing.allocator;
+    
+    // Test tensor creation
+    const shape = Shape.init(&[_]u32{2, 3});
+    var tensor = try Tensor.zeros(allocator, shape, .f32);
+    defer tensor.deinit();
+    
+    try testing.expect(tensor.shape.numel() == 6);
+    try testing.expect(tensor.dtype == .f32);
+    
+    // Test fill
+    try tensor.fill(5.0);
+    const data = try tensor.asSliceF32();
+    try testing.expect(data[0] == 5.0);
+    try testing.expect(data[5] == 5.0);
+}
+
+test "tensor addition" {
+    const testing = std.testing;
+    const allocator = testing.allocator;
+    
+    const shape = Shape.init(&[_]u32{4});
+    var a = try Tensor.ones(allocator, shape, .f32);
+    defer a.deinit();
+    
+    var b = try Tensor.ones(allocator, shape, .f32);
+    defer b.deinit();
+    try b.fill(2.0);
+    
+    var result = try Tensor.zeros(allocator, shape, .f32);
+    defer result.deinit();
+    
+    try a.add(&b, &result);
+    
+    const data = try result.asSliceF32();
+    for (data) |val| {
+        try testing.expect(val == 3.0);
+    }
+} 
--- a/experimental/src/core/tokenizer.zig
+++ b/experimental/src/core/tokenizer.zig
@ -0,0 +1,43 @@
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+
+/// Tokenizer for DeepSeek V3
+pub const Tokenizer = struct {
+    vocab_size: u32,
+    allocator: Allocator,
+    
+    const Self = @This();
+    
+    pub fn init(allocator: Allocator, vocab_size: u32) !Self {
+        std.log.info("Initializing tokenizer with vocab size: {}", .{vocab_size});
+        
+        return Self{
+            .vocab_size = vocab_size,
+            .allocator = allocator,
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        _ = self;
+        // TODO: Cleanup tokenizer resources
+    }
+    
+    pub fn encode(self: *Self, text: []const u8) ![]u32 {
+        // TODO: Implement actual tokenization
+        _ = text;
+        
+        // For now, return dummy tokens
+        const tokens = try self.allocator.alloc(u32, 5);
+        for (0..tokens.len) |i| {
+            tokens[i] = @intCast(i + 1);
+        }
+        return tokens;
+    }
+    
+    pub fn decode(self: *Self, tokens: []const u32) ![]u8 {
+        // TODO: Implement actual detokenization
+        _ = tokens;
+        
+        return try self.allocator.dupe(u8, "Hello, world!");
+    }
+}; 
--- a/experimental/src/core/transformer.zig
+++ b/experimental/src/core/transformer.zig
@ -0,0 +1,40 @@
+const std = @import("std");
+const Allocator = std.mem.Allocator;
+const Tensor = @import("tensor.zig").Tensor;
+const Backend = @import("backend.zig").Backend;
+const model = @import("model.zig");
+
+/// DeepSeek V3 Transformer implementation
+pub const Transformer = struct {
+    config: model.ModelConfig,
+    backend: Backend,
+    allocator: Allocator,
+    
+    // TODO: Add transformer layers
+    // layers: []TransformerLayer,
+    
+    const Self = @This();
+    
+    pub fn init(allocator: Allocator, config: model.ModelConfig, backend: Backend) !Self {
+        // TODO: Initialize transformer layers
+        std.log.info("Initializing Transformer with {} layers", .{config.num_hidden_layers});
+        
+        return Self{
+            .config = config,
+            .backend = backend,
+            .allocator = allocator,
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        // TODO: Cleanup layers
+        _ = self;
+    }
+    
+    pub fn forward(self: *Self, input: *Tensor, output: *Tensor) !void {
+        // TODO: Implement transformer forward pass
+        _ = self;
+        _ = input;
+        _ = output;
+    }
+}; 
--- a/experimental/src/main.zig
+++ b/experimental/src/main.zig
@ -0,0 +1,132 @@
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+const web_layer = @import("web_layer");
+const cpu_backend = @import("cpu_backend");
+const metal_backend = @import("metal_backend");
+const cuda_backend = @import("cuda_backend");
+
+const print = std.debug.print;
+const Allocator = std.mem.Allocator;
+
+const Config = struct {
+    port: u16 = 8080,
+    host: []const u8 = "127.0.0.1",
+    model_path: ?[]const u8 = null,
+    backend: Backend = .cpu,
+    max_concurrent_requests: u32 = 100,
+    max_sequence_length: u32 = 32768,
+    
+    const Backend = enum {
+        cpu,
+        metal,
+        cuda,
+        webgpu,
+    };
+};
+
+pub fn main() !void {
+    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
+    defer _ = gpa.deinit();
+    const allocator = gpa.allocator();
+
+    // Parse command line arguments
+    const config = try parseArgs(allocator);
+    
+    // Initialize the selected backend
+    var backend = try initBackend(allocator, config.backend);
+    defer backend.deinit();
+    
+    // Load the model
+    var model = if (config.model_path) |path|
+        try deepseek_core.Model.loadFromPath(allocator, path, backend)
+    else
+        try deepseek_core.Model.loadDefault(allocator, backend);
+    defer model.deinit();
+    
+    print("🚀 DeepZig V3 Server Starting...\n", .{});
+    print("   Backend: {s}\n", .{@tagName(config.backend)});
+    print("   Host: {s}:{d}\n", .{ config.host, config.port });
+    print("   Model: {s}\n", .{model.info().name});
+    print("   Max Context: {} tokens\n", .{config.max_sequence_length});
+    
+    // Start the web server
+    var server = try web_layer.Server.init(allocator, .{
+        .host = config.host,
+        .port = config.port,
+        .model = model,
+        .max_concurrent_requests = config.max_concurrent_requests,
+    });
+    defer server.deinit();
+    
+    print("✅ Server ready! Send requests to http://{s}:{d}\n", .{ config.host, config.port });
+    print("   Endpoints:\n", .{});
+    print("   - POST /v1/chat/completions (OpenAI compatible)\n", .{});
+    print("   - POST /v1/completions\n", .{});
+    print("   - GET  /v1/models\n", .{});
+    print("   - GET  /health\n", .{});
+    print("   - WebSocket /ws (streaming)\n", .{});
+    
+    try server.listen();
+}
+
+fn parseArgs(allocator: Allocator) !Config {
+    const args = try std.process.argsAlloc(allocator);
+    defer std.process.argsFree(allocator, args);
+    
+    var config = Config{};
+    
+    var i: usize = 1;
+    while (i < args.len) : (i += 1) {
+        const arg = args[i];
+        
+        if (std.mem.eql(u8, arg, "--port") and i + 1 < args.len) {
+            config.port = try std.fmt.parseInt(u16, args[i + 1], 10);
+            i += 1;
+        } else if (std.mem.eql(u8, arg, "--host") and i + 1 < args.len) {
+            config.host = args[i + 1];
+            i += 1;
+        } else if (std.mem.eql(u8, arg, "--model") and i + 1 < args.len) {
+            config.model_path = args[i + 1];
+            i += 1;
+        } else if (std.mem.eql(u8, arg, "--backend") and i + 1 < args.len) {
+            const backend_str = args[i + 1];
+            config.backend = std.meta.stringToEnum(Config.Backend, backend_str) orelse {
+                print("Unknown backend: {s}\n", .{backend_str});
+                print("Available backends: cpu, metal, cuda, webgpu\n", .{});
+                std.process.exit(1);
+            };
+            i += 1;
+        } else if (std.mem.eql(u8, arg, "--help") or std.mem.eql(u8, arg, "-h")) {
+            printHelp();
+            std.process.exit(0);
+        }
+    }
+    
+    return config;
+}
+
+fn initBackend(allocator: Allocator, backend_type: Config.Backend) !deepseek_core.Backend {
+    return switch (backend_type) {
+        .cpu => cpu_backend.init(allocator),
+        .metal => metal_backend.init(allocator),
+        .cuda => cuda_backend.init(allocator),
+        .webgpu => {
+            print("WebGPU backend not yet implemented, falling back to CPU\n", .{});
+            return cpu_backend.init(allocator);
+        },
+    };
+}
+
+fn printHelp() void {
+    print("DeepZig V3 - High-Performance LLM Inference\n\n", .{});
+    print("Usage: deepseek-v3-zig [OPTIONS]\n\n", .{});
+    print("Options:\n", .{});
+    print("  --port <PORT>        Port to listen on (default: 8080)\n", .{});
+    print("  --host <HOST>        Host to bind to (default: 127.0.0.1)\n", .{});
+    print("  --model <PATH>       Path to model weights\n", .{});
+    print("  --backend <BACKEND>  Backend to use: cpu, metal, cuda, webgpu (default: cpu)\n", .{});
+    print("  --help, -h           Show this help message\n\n", .{});
+    print("Examples:\n", .{});
+    print("  deepseek-v3-zig --port 3000 --backend metal\n", .{});
+    print("  deepseek-v3-zig --model ./models/deepseek-v3.bin --backend cuda\n", .{});
+} 
--- a/experimental/src/wasm/main.zig
+++ b/experimental/src/wasm/main.zig
@ -0,0 +1,127 @@
+// WebAssembly Entry Point for DeepSeek V3
+// Enables browser-based inference with minimal dependencies
+
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+
+// WebAssembly allocator using the heap
+var gpa = std.heap.GeneralPurposeAllocator(.{}){};
+const allocator = gpa.allocator();
+
+/// WebAssembly exports for JavaScript interop
+/// These functions are callable from JavaScript
+
+/// Initialize the model (exported to JS)
+export fn wasm_init_model() i32 {
+    // TODO: Initialize a smaller model suitable for browser
+    std.log.info("Initializing DeepSeek V3 for WebAssembly", .{});
+    
+    // For browser use, we'd use a much smaller model or quantized version
+    // Return success status
+    return 0; // Success
+}
+
+/// Generate text completion (exported to JS)
+export fn wasm_generate_text(
+    input_ptr: [*]const u8,
+    input_len: u32,
+    output_ptr: [*]u8,
+    output_max_len: u32,
+) u32 {
+    const input = input_ptr[0..input_len];
+    const output_buffer = output_ptr[0..output_max_len];
+    
+    std.log.info("WASM text generation: {s}", .{input});
+    
+    // TODO: Implement actual generation
+    // For now, return a placeholder response
+    const response = "Hello from DeepSeek V3 WASM! Input was: ";
+    const full_response = std.fmt.bufPrint(
+        output_buffer, 
+        "{s}{s}", 
+        .{ response, input }
+    ) catch {
+        // If buffer too small, return error length
+        return 0;
+    };
+    
+    return @intCast(full_response.len);
+}
+
+/// Tokenize text (exported to JS)
+export fn wasm_tokenize(
+    text_ptr: [*]const u8,
+    text_len: u32,
+    tokens_ptr: [*]u32,
+    max_tokens: u32,
+) u32 {
+    const text = text_ptr[0..text_len];
+    const tokens_buffer = tokens_ptr[0..max_tokens];
+    
+    // TODO: Implement actual tokenization
+    // For now, return dummy tokens
+    const token_count = @min(text.len / 4, max_tokens); // Rough estimate
+    
+    for (0..token_count) |i| {
+        tokens_buffer[i] = @intCast(i + 1000); // Dummy token IDs
+    }
+    
+    return @intCast(token_count);
+}
+
+/// Get model information (exported to JS)
+export fn wasm_get_model_info(
+    info_ptr: [*]u8,
+    info_max_len: u32,
+) u32 {
+    const info_buffer = info_ptr[0..info_max_len];
+    
+    const model_info = 
+        \\{"name":"DeepSeek-V3-WASM","version":"0.1.0","context_length":4096}
+    ;
+    
+    if (model_info.len > info_max_len) {
+        return 0; // Buffer too small
+    }
+    
+    @memcpy(info_buffer[0..model_info.len], model_info);
+    return @intCast(model_info.len);
+}
+
+/// Allocate memory for JavaScript (exported to JS)
+export fn wasm_alloc(size: u32) ?*anyopaque {
+    const bytes = allocator.alloc(u8, size) catch return null;
+    return bytes.ptr;
+}
+
+/// Free memory allocated by wasm_alloc (exported to JS)
+export fn wasm_free(ptr: ?*anyopaque, size: u32) void {
+    if (ptr) |p| {
+        const bytes: [*]u8 = @ptrCast(p);
+        allocator.free(bytes[0..size]);
+    }
+}
+
+/// Main entry point (called by Zig, not exported to JS)
+pub fn main() !void {
+    std.log.info("DeepSeek V3 WebAssembly module loaded", .{});
+    
+    // Initialize core components
+    deepseek_core.init();
+    
+    // WASM modules don't have a traditional main loop
+    // All interaction happens through exported functions
+}
+
+/// Panic handler for WebAssembly
+pub fn panic(message: []const u8, stack_trace: ?*std.builtin.StackTrace, ret_addr: ?usize) noreturn {
+    _ = stack_trace;
+    _ = ret_addr;
+    
+    // In WASM, we can't print to stderr normally
+    // Log the panic message and abort
+    std.log.err("WASM Panic: {s}", .{message});
+    
+    // Trap the WebAssembly execution
+    unreachable;
+}
--- a/experimental/src/web/handlers.zig
+++ b/experimental/src/web/handlers.zig
@ -0,0 +1,156 @@
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+const openai = @import("openai.zig");
+
+const Allocator = std.mem.Allocator;
+const http = std.http;
+
+/// Handle chat completions endpoint (OpenAI compatible)
+pub fn chatCompletions(
+    allocator: Allocator,
+    model: *deepseek_core.Model,
+    request: *http.Server.Request,
+) !void {
+    _ = allocator;
+    _ = model;
+    
+    // For now, send a simple placeholder response
+    const response_json = 
+        \\{
+        \\  "id": "chatcmpl-123",
+        \\  "object": "chat.completion", 
+        \\  "created": 1677652288,
+        \\  "model": "deepzig-v3",
+        \\  "choices": [{
+        \\    "index": 0,
+        \\    "message": {
+        \\      "role": "assistant",
+        \\      "content": "Hello! This is a placeholder response from DeepZig V3."
+        \\    },
+        \\    "finish_reason": "stop"
+        \\  }],
+        \\  "usage": {
+        \\    "prompt_tokens": 10,
+        \\    "completion_tokens": 15,
+        \\    "total_tokens": 25
+        \\  }
+        \\}
+    ;
+    
+    try request.respond(response_json, .{
+        .extra_headers = &.{
+            .{ .name = "content-type", .value = "application/json" },
+        },
+    });
+}
+
+/// Handle text completions endpoint
+pub fn completions(
+    allocator: Allocator,
+    model: *deepseek_core.Model,
+    request: *http.Server.Request,
+) !void {
+    _ = allocator;
+    _ = model;
+    
+    try request.respond("Text completions not yet implemented", .{
+        .status = .not_implemented,
+    });
+}
+
+/// Handle models list endpoint
+pub fn models(
+    allocator: Allocator,
+    model: *deepseek_core.Model,
+    request: *http.Server.Request,
+) !void {
+    _ = allocator;
+    _ = model;
+    
+    const response_json = 
+        \\{
+        \\  "object": "list",
+        \\  "data": [{
+        \\    "id": "deepzig-v3",
+        \\    "object": "model",
+        \\    "created": 1677652288,
+        \\    "owned_by": "deepzig"
+        \\  }]
+        \\}
+    ;
+    
+    try request.respond(response_json, .{
+        .extra_headers = &.{
+            .{ .name = "content-type", .value = "application/json" },
+        },
+    });
+}
+
+/// Handle health check endpoint
+pub fn health(allocator: Allocator, request: *http.Server.Request) !void {
+    _ = allocator;
+    
+    const response_json = 
+        \\{
+        \\  "status": "healthy",
+        \\  "timestamp": 1677652288,
+        \\  "version": "0.1.0"
+        \\}
+    ;
+    
+    try request.respond(response_json, .{
+        .extra_headers = &.{
+            .{ .name = "content-type", .value = "application/json" },
+        },
+    });
+}
+
+/// Handle WebSocket endpoint
+pub fn websocket(
+    allocator: Allocator,
+    model: *deepseek_core.Model,
+    request: *http.Server.Request,
+) !void {
+    _ = allocator;
+    _ = model;
+    
+    try request.respond("WebSocket not yet implemented", .{
+        .status = .not_implemented,
+    });
+}
+
+/// Generate chat completion response (helper function)
+fn generateChatCompletion(
+    allocator: Allocator,
+    model: *deepseek_core.Model,
+    chat_request: openai.ChatCompletionRequest,
+) !*openai.ChatCompletionResponse {
+    // TODO: Implement actual generation
+    _ = model;
+    _ = chat_request;
+    
+    const response = try allocator.create(openai.ChatCompletionResponse);
+    response.* = openai.ChatCompletionResponse{
+        .id = "chatcmpl-123",
+        .object = "chat.completion",
+        .created = std.time.timestamp(),
+        .model = "deepzig-v3",
+        .choices = &[_]openai.Choice{
+            .{
+                .index = 0,
+                .message = openai.Message{
+                    .role = "assistant",
+                    .content = "Hello! This is a placeholder response from DeepZig V3.",
+                },
+                .finish_reason = "stop",
+            },
+        },
+        .usage = openai.Usage{
+            .prompt_tokens = 10,
+            .completion_tokens = 15,
+            .total_tokens = 25,
+        },
+    };
+    
+    return response;
+} 
--- a/experimental/src/web/middleware.zig
+++ b/experimental/src/web/middleware.zig
@ -0,0 +1,100 @@
+const std = @import("std");
+const http = std.http;
+const Allocator = std.mem.Allocator;
+
+/// CORS middleware configuration
+pub const CorsConfig = struct {
+    allow_origins: []const []const u8 = &[_][]const u8{"*"},
+    allow_methods: []const []const u8 = &[_][]const u8{"GET", "POST", "PUT", "DELETE", "OPTIONS"},
+    allow_headers: []const []const u8 = &[_][]const u8{"Content-Type", "Authorization"},
+    max_age: u32 = 86400, // 24 hours
+};
+
+/// Add CORS headers to response
+pub fn cors(response: *http.Server.Response, config: CorsConfig) !void {
+    _ = config;
+    // TODO: For now, just add basic CORS headers
+    // In a real implementation, you'd check the request origin against allowed origins
+    try response.headers.append("Access-Control-Allow-Origin", "*");
+    try response.headers.append("Access-Control-Allow-Methods", "GET, POST, PUT, DELETE, OPTIONS");
+    try response.headers.append("Access-Control-Allow-Headers", "Content-Type, Authorization");
+}
+
+/// Request logging middleware
+pub fn logRequest(response: *http.Server.Response) void {
+    const method = response.request.method;
+    const target = response.request.target;
+    const timestamp = std.time.timestamp();
+    
+    std.log.info("[{}] {s} {s}", .{ timestamp, @tagName(method), target });
+}
+
+/// Rate limiting middleware (basic implementation)
+pub const RateLimiter = struct {
+    requests: std.HashMap(u32, RequestCount, std.hash_map.DefaultContext(u32), std.hash_map.default_max_load_percentage),
+    allocator: Allocator,
+    max_requests: u32,
+    window_seconds: u32,
+    
+    const RequestCount = struct {
+        count: u32,
+        window_start: i64,
+    };
+    
+    pub fn init(allocator: Allocator, max_requests: u32, window_seconds: u32) RateLimiter {
+        return RateLimiter{
+            .requests = std.HashMap(u32, RequestCount, std.hash_map.DefaultContext(u32), std.hash_map.default_max_load_percentage).init(allocator),
+            .allocator = allocator,
+            .max_requests = max_requests,
+            .window_seconds = window_seconds,
+        };
+    }
+    
+    pub fn deinit(self: *RateLimiter) void {
+        self.requests.deinit();
+    }
+    
+    /// Check if request is allowed (simplified IP-based rate limiting)
+    pub fn checkRate(self: *RateLimiter, client_ip: u32) bool {
+        const now = std.time.timestamp();
+        const window_start = now - self.window_seconds;
+        
+        const result = self.requests.getOrPut(client_ip) catch return false;
+        
+        if (!result.found_existing) {
+            // New client
+            result.value_ptr.* = RequestCount{
+                .count = 1,
+                .window_start = now,
+            };
+            return true;
+        }
+        
+        // Check if we're in a new window
+        if (result.value_ptr.window_start < window_start) {
+            result.value_ptr.count = 1;
+            result.value_ptr.window_start = now;
+            return true;
+        }
+        
+        // Check if under limit
+        if (result.value_ptr.count < self.max_requests) {
+            result.value_ptr.count += 1;
+            return true;
+        }
+        
+        return false; // Rate limited
+    }
+};
+
+/// Authentication middleware (basic bearer token)
+pub fn authenticateBearer(response: *http.Server.Response, expected_token: []const u8) bool {
+    const auth_header = response.request.headers.getFirstValue("Authorization") orelse return false;
+    
+    if (!std.mem.startsWith(u8, auth_header, "Bearer ")) {
+        return false;
+    }
+    
+    const token = auth_header[7..]; // Skip "Bearer "
+    return std.mem.eql(u8, token, expected_token);
+} 
--- a/experimental/src/web/openai.zig
+++ b/experimental/src/web/openai.zig
@ -0,0 +1,57 @@
+const std = @import("std");
+
+// OpenAI API compatible structures
+
+/// Chat completion request
+pub const ChatCompletionRequest = struct {
+    model: []const u8,
+    messages: []Message,
+    max_tokens: ?u32 = null,
+    temperature: ?f32 = null,
+    top_p: ?f32 = null,
+    stream: ?bool = null,
+};
+
+/// Chat message
+pub const Message = struct {
+    role: []const u8, // "system", "user", "assistant"
+    content: []const u8,
+};
+
+/// Chat completion response
+pub const ChatCompletionResponse = struct {
+    id: []const u8,
+    object: []const u8, // "chat.completion"
+    created: i64,
+    model: []const u8,
+    choices: []Choice,
+    usage: Usage,
+};
+
+/// Choice in completion response
+pub const Choice = struct {
+    index: u32,
+    message: Message,
+    finish_reason: []const u8, // "stop", "length", "content_filter"
+};
+
+/// Token usage information
+pub const Usage = struct {
+    prompt_tokens: u32,
+    completion_tokens: u32,
+    total_tokens: u32,
+};
+
+/// Models list response
+pub const ModelsResponse = struct {
+    object: []const u8, // "list"
+    data: []ModelInfo,
+};
+
+/// Model information
+pub const ModelInfo = struct {
+    id: []const u8,
+    object: []const u8, // "model"
+    created: i64,
+    owned_by: []const u8,
+};
--- a/experimental/src/web/request.zig
+++ b/experimental/src/web/request.zig
@ -0,0 +1,75 @@
+const std = @import("std");
+const http = std.http;
+const Allocator = std.mem.Allocator;
+
+/// Request wrapper for easier handling
+pub const Request = struct {
+    inner: *http.Server.Request,
+    allocator: Allocator,
+    
+    const Self = @This();
+    
+    pub fn init(inner: *http.Server.Request, allocator: Allocator) Self {
+        return Self{
+            .inner = inner,
+            .allocator = allocator,
+        };
+    }
+    
+    /// Get request method
+    pub fn method(self: *const Self) http.Method {
+        return self.inner.method;
+    }
+    
+    /// Get request path/target
+    pub fn path(self: *const Self) []const u8 {
+        return self.inner.target;
+    }
+    
+    /// Get header value
+    pub fn header(self: *const Self, name: []const u8) ?[]const u8 {
+        return self.inner.headers.getFirstValue(name);
+    }
+    
+    /// Get query parameter (simple implementation)
+    pub fn query(self: *const Self, name: []const u8) ?[]const u8 {
+        const target = self.inner.target;
+        if (std.mem.indexOf(u8, target, "?")) |query_start| {
+            const query_string = target[query_start + 1..];
+            var iter = std.mem.split(u8, query_string, "&");
+            
+            while (iter.next()) |param| {
+                if (std.mem.indexOf(u8, param, "=")) |eq_pos| {
+                    const key = param[0..eq_pos];
+                    const value = param[eq_pos + 1..];
+                    if (std.mem.eql(u8, key, name)) {
+                        return value;
+                    }
+                }
+            }
+        }
+        return null;
+    }
+    
+    /// Extract path parameter (e.g., /users/{id} -> id value)
+    pub fn pathParam(self: *const Self, name: []const u8) ?[]const u8 {
+        // TODO: Implement proper path parameter extraction
+        // This would require route pattern matching
+        _ = self;
+        _ = name;
+        return null;
+    }
+    
+    /// Get content type
+    pub fn contentType(self: *const Self) ?[]const u8 {
+        return self.header("Content-Type");
+    }
+    
+    /// Check if request is JSON
+    pub fn isJson(self: *const Self) bool {
+        if (self.contentType()) |ct| {
+            return std.mem.startsWith(u8, ct, "application/json");
+        }
+        return false;
+    }
+}; 
--- a/experimental/src/web/response.zig
+++ b/experimental/src/web/response.zig
@ -0,0 +1,92 @@
+const std = @import("std");
+const http = std.http;
+const Allocator = std.mem.Allocator;
+
+/// Response wrapper for easier handling
+pub const Response = struct {
+    inner: *http.Server.Response,
+    allocator: Allocator,
+    
+    const Self = @This();
+    
+    pub fn init(inner: *http.Server.Response, allocator: Allocator) Self {
+        return Self{
+            .inner = inner,
+            .allocator = allocator,
+        };
+    }
+    
+    /// Set response status
+    pub fn setStatus(self: *Self, status: http.Status) void {
+        self.inner.status = status;
+    }
+    
+    /// Set header
+    pub fn setHeader(self: *Self, name: []const u8, value: []const u8) !void {
+        try self.inner.headers.append(name, value);
+    }
+    
+    /// Send JSON response
+    pub fn sendJson(self: *Self, data: anytype) !void {
+        const json_string = try std.json.stringifyAlloc(
+            self.allocator,
+            data,
+            .{ .whitespace = .indent_2 },
+        );
+        defer self.allocator.free(json_string);
+        
+        try self.setHeader("Content-Type", "application/json");
+        self.inner.transfer_encoding = .{ .content_length = json_string.len };
+        try self.inner.do();
+        
+        try self.inner.writeAll(json_string);
+        try self.inner.finish();
+    }
+    
+    /// Send text response
+    pub fn sendText(self: *Self, text: []const u8) !void {
+        try self.setHeader("Content-Type", "text/plain");
+        self.inner.transfer_encoding = .{ .content_length = text.len };
+        try self.inner.do();
+        
+        try self.inner.writeAll(text);
+        try self.inner.finish();
+    }
+    
+    /// Send HTML response
+    pub fn sendHtml(self: *Self, html: []const u8) !void {
+        try self.setHeader("Content-Type", "text/html");
+        self.inner.transfer_encoding = .{ .content_length = html.len };
+        try self.inner.do();
+        
+        try self.inner.writeAll(html);
+        try self.inner.finish();
+    }
+    
+    /// Send error response
+    pub fn sendError(self: *Self, status: http.Status, message: []const u8) !void {
+        const error_response = struct {
+            @"error": struct {
+                message: []const u8,
+                type: []const u8,
+                code: u16,
+            },
+        }{
+            .@"error" = .{
+                .message = message,
+                .type = "error",
+                .code = @intFromEnum(status),
+            },
+        };
+        
+        self.setStatus(status);
+        try self.sendJson(error_response);
+    }
+    
+    /// Redirect to another URL
+    pub fn redirect(self: *Self, location: []const u8) !void {
+        self.setStatus(.found);
+        try self.setHeader("Location", location);
+        try self.sendText("");
+    }
+}; 
--- a/experimental/src/web/root.zig
+++ b/experimental/src/web/root.zig
@ -0,0 +1,34 @@
+// DeepSeek V3 Web Layer
+// HTTP server and API endpoints
+
+const std = @import("std");
+
+// Web components
+pub const Server = @import("server.zig").Server;
+pub const handlers = @import("handlers.zig");
+pub const middleware = @import("middleware.zig");
+pub const websocket = @import("websocket.zig");
+
+// OpenAI API compatibility
+pub const openai = @import("openai.zig");
+
+// Response types
+pub const Response = @import("response.zig").Response;
+pub const Request = @import("request.zig").Request;
+
+// Error handling
+pub const WebError = error{
+    InvalidRequest,
+    Unauthorized,
+    RateLimited,
+    ServerError,
+    ModelNotFound,
+    BadRequest,
+};
+
+// Tests
+test "web layer" {
+    const testing = std.testing;
+    _ = testing;
+    // TODO: Add web layer tests
+} 
--- a/experimental/src/web/server.zig
+++ b/experimental/src/web/server.zig
@ -0,0 +1,239 @@
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+const handlers = @import("handlers.zig");
+const middleware = @import("middleware.zig");
+
+const Allocator = std.mem.Allocator;
+const net = std.net;
+const http = std.http;
+
+/// Server configuration
+pub const ServerConfig = struct {
+    host: []const u8,
+    port: u16,
+    model: deepseek_core.Model,
+    max_concurrent_requests: u32,
+    request_timeout_ms: u32 = 30000,
+    max_body_size: usize = 1024 * 1024, // 1MB
+};
+
+/// HTTP server for DeepSeek V3 API
+pub const Server = struct {
+    config: ServerConfig,
+    allocator: Allocator,
+    server: net.Server,
+    
+    const Self = @This();
+    
+    pub fn init(allocator: Allocator, config: ServerConfig) !Self {
+        const address = net.Address.parseIp4(config.host, config.port) catch |err| {
+            std.log.err("Failed to parse IP address {s}:{d}: {}", .{ config.host, config.port, err });
+            return err;
+        };
+        
+        const server = address.listen(.{}) catch |err| {
+            std.log.err("Failed to listen on {s}:{d}: {}", .{ config.host, config.port, err });
+            return err;
+        };
+        
+        return Self{
+            .config = config,
+            .allocator = allocator,
+            .server = server,
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        self.server.deinit();
+    }
+    
+    /// Start listening for requests
+    pub fn listen(self: *Self) !void {
+        std.log.info("Server listening on {s}:{d}", .{ self.config.host, self.config.port });
+        
+        while (true) {
+            // Accept connection
+            const connection = self.server.accept() catch |err| {
+                std.log.err("Failed to accept connection: {}", .{err});
+                continue;
+            };
+            defer connection.stream.close();
+            
+            // Handle request
+            self.handleConnection(connection) catch |err| {
+                std.log.err("Failed to handle connection: {}", .{err});
+                continue;
+            };
+        }
+    }
+    
+    /// Handle individual connection
+    fn handleConnection(self: *Self, connection: net.Server.Connection) !void {
+        var read_buffer: [4096]u8 = undefined;
+        var http_server = http.Server.init(connection, &read_buffer);
+        
+        // Receive request head
+        var request = http_server.receiveHead() catch |err| {
+            std.log.err("Failed to receive HTTP head: {}", .{err});
+            return;
+        };
+        
+        std.log.debug("Request: {s} {s}", .{ @tagName(request.head.method), request.head.target });
+        
+        // Route and handle request
+        try self.handleRequest(&request);
+    }
+    
+    /// Route and handle HTTP request
+    fn handleRequest(self: *Self, request: *http.Server.Request) !void {
+        const target = request.head.target;
+        
+        // Route requests based on path
+        if (std.mem.startsWith(u8, target, "/v1/chat/completions")) {
+            try self.handleChatCompletions(request);
+        } else if (std.mem.startsWith(u8, target, "/v1/completions")) {
+            try self.handleCompletions(request);
+        } else if (std.mem.startsWith(u8, target, "/v1/models")) {
+            try self.handleModels(request);
+        } else if (std.mem.startsWith(u8, target, "/health")) {
+            try self.handleHealth(request);
+        } else if (std.mem.startsWith(u8, target, "/ws")) {
+            try self.handleWebSocket(request);
+        } else {
+            try self.sendNotFound(request);
+        }
+    }
+    
+    /// Handle chat completions endpoint
+    fn handleChatCompletions(self: *Self, request: *http.Server.Request) !void {
+        _ = self;
+        
+        // For now, send a simple placeholder response
+        const response_json = 
+            \\{
+            \\  "id": "chatcmpl-123",
+            \\  "object": "chat.completion", 
+            \\  "created": 1677652288,
+            \\  "model": "deepzig-v3",
+            \\  "choices": [{
+            \\    "index": 0,
+            \\    "message": {
+            \\      "role": "assistant",
+            \\      "content": "Hello! This is a placeholder response from DeepZig V3."
+            \\    },
+            \\    "finish_reason": "stop"
+            \\  }],
+            \\  "usage": {
+            \\    "prompt_tokens": 10,
+            \\    "completion_tokens": 15,
+            \\    "total_tokens": 25
+            \\  }
+            \\}
+        ;
+        
+        try request.respond(response_json, .{
+            .extra_headers = &.{
+                .{ .name = "content-type", .value = "application/json" },
+            },
+        });
+    }
+    
+    /// Handle text completions endpoint
+    fn handleCompletions(self: *Self, request: *http.Server.Request) !void {
+        _ = self;
+        try request.respond("Text completions not yet implemented", .{
+            .status = .not_implemented,
+        });
+    }
+    
+    /// Handle models list endpoint
+    fn handleModels(self: *Self, request: *http.Server.Request) !void {
+        _ = self;
+        
+        const response_json = 
+            \\{
+            \\  "object": "list",
+            \\  "data": [{
+            \\    "id": "deepzig-v3",
+            \\    "object": "model",
+            \\    "created": 1677652288,
+            \\    "owned_by": "deepzig"
+            \\  }]
+            \\}
+        ;
+        
+        try request.respond(response_json, .{
+            .extra_headers = &.{
+                .{ .name = "content-type", .value = "application/json" },
+            },
+        });
+    }
+    
+    /// Handle health check endpoint
+    fn handleHealth(self: *Self, request: *http.Server.Request) !void {
+        _ = self;
+        
+        const response_json = 
+            \\{
+            \\  "status": "healthy",
+            \\  "timestamp": 1677652288,
+            \\  "version": "0.1.0"
+            \\}
+        ;
+        
+        try request.respond(response_json, .{
+            .extra_headers = &.{
+                .{ .name = "content-type", .value = "application/json" },
+            },
+        });
+    }
+    
+    /// Handle WebSocket endpoint (placeholder)
+    fn handleWebSocket(self: *Self, request: *http.Server.Request) !void {
+        _ = self;
+        try request.respond("WebSocket not yet implemented", .{
+            .status = .not_implemented,
+        });
+    }
+    
+    /// Send 404 Not Found response
+    fn sendNotFound(self: *Self, request: *http.Server.Request) !void {
+        _ = self;
+        try request.respond("{\"error\":\"Not Found\"}", .{
+            .status = .not_found,
+            .extra_headers = &.{
+                .{ .name = "content-type", .value = "application/json" },
+            },
+        });
+    }
+};
+
+// Tests
+test "server creation" {
+    const testing = std.testing;
+    const allocator = testing.allocator;
+    
+    // Mock model for testing
+    const model = deepseek_core.Model{
+        .config = deepseek_core.Model.ModelConfig.deepseekV3Default(),
+        .transformer = undefined,
+        .tokenizer = undefined,
+        .backend = deepseek_core.Backend.init(allocator, .cpu, 0),
+        .allocator = allocator,
+        .embed_tokens = undefined,
+        .embed_positions = null,
+        .lm_head = undefined,
+        .norm = undefined,
+    };
+    
+    const config = ServerConfig{
+        .host = "127.0.0.1",
+        .port = 0, // Let OS choose port for testing
+        .model = model,
+        .max_concurrent_requests = 10,
+    };
+    
+    // Note: Can't actually create server in test due to socket binding
+    // This would require integration tests
+    _ = config;
+}
--- a/experimental/src/web/websocket.zig
+++ b/experimental/src/web/websocket.zig
@ -0,0 +1,102 @@
+const std = @import("std");
+const deepseek_core = @import("deepseek_core");
+
+const Allocator = std.mem.Allocator;
+
+/// WebSocket connection state
+pub const WebSocketState = enum {
+    connecting,
+    connected,
+    closing,
+    closed,
+};
+
+/// WebSocket frame types
+pub const FrameType = enum {
+    text,
+    binary,
+    close,
+    ping,
+    pong,
+};
+
+/// WebSocket connection handler
+pub const WebSocketConnection = struct {
+    allocator: Allocator,
+    state: WebSocketState,
+    model: *deepseek_core.Model,
+    
+    const Self = @This();
+    
+    pub fn init(allocator: Allocator, model: *deepseek_core.Model) Self {
+        return Self{
+            .allocator = allocator,
+            .state = .connecting,
+            .model = model,
+        };
+    }
+    
+    pub fn deinit(self: *Self) void {
+        self.state = .closed;
+    }
+    
+    /// Handle incoming WebSocket frame
+    pub fn handleFrame(self: *Self, frame_type: FrameType, data: []const u8) !void {
+        switch (frame_type) {
+            .text => try self.handleTextMessage(data),
+            .binary => try self.handleBinaryMessage(data),
+            .close => self.state = .closing,
+            .ping => try self.sendPong(data),
+            .pong => {}, // Handle pong if needed
+        }
+    }
+    
+    /// Handle text message (JSON chat requests)
+    fn handleTextMessage(self: *Self, data: []const u8) !void {
+        _ = self;
+        std.log.info("WebSocket text message: {s}", .{data});
+        
+        // TODO: Parse JSON chat request and stream response back
+        // This would involve:
+        // 1. Parse incoming JSON (chat completion request)
+        // 2. Start model generation
+        // 3. Stream tokens back as they're generated
+        // 4. Send completion when done
+    }
+    
+    /// Handle binary message
+    fn handleBinaryMessage(self: *Self, data: []const u8) !void {
+        _ = self;
+        _ = data;
+        std.log.info("WebSocket binary message received", .{});
+        // TODO: Handle binary data if needed
+    }
+    
+    /// Send pong response to ping
+    fn sendPong(self: *Self, data: []const u8) !void {
+        _ = self;
+        _ = data;
+        // TODO: Send WebSocket pong frame
+        std.log.debug("Sending WebSocket pong");
+    }
+    
+    /// Send text message to client
+    pub fn sendText(self: *Self, message: []const u8) !void {
+        _ = self;
+        // TODO: Implement WebSocket frame encoding and sending
+        std.log.debug("Sending WebSocket text: {s}", .{message});
+    }
+    
+    /// Send streaming token
+    pub fn sendStreamingToken(self: *Self, token: []const u8) !void {
+        // TODO: Format as Server-Sent Events style JSON and send
+        const json_chunk = try std.fmt.allocPrint(
+            self.allocator,
+            "{{\"choices\":[{{\"delta\":{{\"content\":\"{s}\"}}}}]}}",
+            .{token}
+        );
+        defer self.allocator.free(json_chunk);
+        
+        try self.sendText(json_chunk);
+    }
+}; 
--- a/experimental/zig-out/bin/deepseek-v3-zig
+++ b/experimental/zig-out/bin/deepseek-v3-zig