OktoSeek Documentation

Understanding AI Concepts with Code Examples

Table of Contents

About OktoSeek

OktoSeek is a research center in Artificial Intelligence dedicated to making AI development accessible to everyone. We're building tools that simplify the creation of incredible things—from training custom models to deploying intelligent solutions.

Currently in version 1.0.20, OktoSeek IDE (OktoStudio) is under active development with improvements being implemented daily. Our mission is to bridge the gap between complex AI research and practical application, enabling developers, researchers, and educators to harness the power of AI without needing extensive technical expertise.

As a research center, we're constantly exploring new ways to make AI training more efficient, monitoring more insightful, and deployment more straightforward. Every feature in OktoSeek IDE is designed with one goal: to help you create amazing things with AI.

Whether you're training a model for the first time or managing complex research projects, OktoSeek provides the tools, insights, and support you need to succeed. We believe that when AI development becomes more accessible, innovation becomes more widespread.

Datasets: The Foundation of AI Training

Datasets are collections of data that teach your AI model. Think of them as textbooks for your AI—the better and more relevant the data, the smarter your model becomes. In OktoSeek IDE, you can create datasets from text, images, or even videos.

The IDE automatically merges multiple datasets, allowing you to combine different sources. You can also define percentages for each dataset, controlling how much each source contributes to training. This gives you fine-grained control over what your model learns.

# Example: Dataset Merging in OktoSeek IDE # The IDE automatically handles this when you select multiple datasets: datasets = { "code_examples.csv": 60, # 60% weight "documentation.txt": 30, # 30% weight "comments.json": 10 # 10% weight } # IDE automatically: # 1. Loads all datasets # 2. Normalizes and merges them # 3. Applies the specified percentages # 4. Creates a balanced training set

Loss

Loss functions measure how well your AI model is performing. They calculate the difference between the model's predictions and the actual correct answers. Lower loss means better performance.

Common loss functions include Mean Squared Error (MSE) for regression tasks and Cross-Entropy Loss for classification tasks.

# Example: Calculating Cross-Entropy Loss in Python import torch import torch.nn as nn # Model predictions (logits) predictions = torch.tensor([[2.0, 1.0, 0.1]]) # True labels (class 0) targets = torch.tensor([0]) # Calculate loss criterion = nn.CrossEntropyLoss() loss = criterion(predictions, targets) print(f"Loss: {loss.item():.4f}") # Output: Loss: 0.4170

Perplexity

Perplexity measures how "surprised" your language model is by the data. Lower perplexity means the model is more confident and making better predictions. It's calculated as the exponent of the loss.

For language models, perplexity of 10 means the model is as confused as if it had to choose uniformly among 10 possibilities.

# Example: Calculating Perplexity import math # Loss from your model loss = 2.5 # Perplexity = e^loss perplexity = math.exp(loss) print(f"Perplexity: {perplexity:.2f}") # Output: Perplexity: 12.18 # Lower loss = Lower perplexity (better) # Loss 1.0 → Perplexity 2.72 (excellent) # Loss 3.0 → Perplexity 20.09 (needs improvement)

Weight Decay

Weight decay is a regularization technique that prevents overfitting by penalizing large weights. It adds a small penalty to the loss function based on the magnitude of the model's weights.

This helps the model generalize better to new data by keeping weights small and preventing the model from memorizing the training data.

# Example: Using Weight Decay in PyTorch import torch.optim as optim # Create optimizer with weight decay (L2 regularization) optimizer = optim.Adam( model.parameters(), lr=0.001, weight_decay=0.01 # L2 penalty coefficient ) # Weight decay adds: loss = original_loss + 0.01 * sum(weights^2) # This encourages smaller weights, reducing overfitting

Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that AI models can understand. Tokens can be words, subwords, or even characters depending on the tokenizer.

Modern language models use subword tokenization (like BPE or WordPiece) to handle unknown words and reduce vocabulary size.

# Example: Tokenization with Hugging Face Transformers from transformers import GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # Tokenize text text = "Hello, how are you?" tokens = tokenizer.encode(text) print(f"Tokens: {tokens}") # Output: Tokens: [15496, 11, 527, 499, 30] # Decode back to text decoded = tokenizer.decode(tokens) print(f"Decoded: {decoded}") # Output: Decoded: Hello, how are you?

Local Inference

Local inference means running AI models on your own computer instead of sending data to cloud servers. This provides better privacy, faster responses, and no internet dependency.

With OktoSeek IDE, you can train and run models entirely on your local machine, keeping your data private and secure.

# Example: Running Local Inference with PyTorch import torch from transformers import GPT2LMHeadModel, GPT2Tokenizer # Load model locally (no internet needed after download) model = GPT2LMHeadModel.from_pretrained('./local_model') tokenizer = GPT2Tokenizer.from_pretrained('./local_model') # Run inference on your machine input_text = "The future of AI is" inputs = tokenizer(input_text, return_tensors='pt') # Generate locally (no cloud API calls) with torch.no_grad(): outputs = model.generate(**inputs, max_length=50) result = tokenizer.decode(outputs[0]) print(result) # All processing happens on your computer!

QLoRA (Quantized Low-Rank Adaptation)

QLoRA is an efficient fine-tuning technique that allows you to train large language models on consumer hardware. It uses 4-bit quantization and Low-Rank Adaptation (LoRA) to reduce memory requirements by up to 75%.

This makes it possible to fine-tune models like LLaMA on a single GPU with 16GB of VRAM.

# Example: QLoRA Fine-tuning Setup from peft import LoraConfig, get_peft_model from transformers import BitsAndBytesConfig # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16 ) # Load model with quantization model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-2-7b", quantization_config=bnb_config ) # Add LoRA adapters (only train small adapter layers) lora_config = LoraConfig( r=16, # Rank lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05 ) model = get_peft_model(model, lora_config) # Now you can fine-tune on a single GPU!

ONNX (Open Neural Network Exchange)

ONNX is an open format for representing machine learning models. It allows you to export models trained in one framework (like PyTorch) and run them in another (like TensorFlow, Caffe2, or mobile apps).

OktoSeek IDE can export your trained models to ONNX format for deployment across different platforms.

# Example: Exporting PyTorch Model to ONNX import torch.onnx # Your trained model model.eval() # Dummy input (example shape) dummy_input = torch.randn(1, 128) # Export to ONNX torch.onnx.export( model, dummy_input, "model.onnx", input_names=['input'], output_names=['output'], dynamic_axes={'input': {0: 'batch_size'}} ) print("Model exported to model.onnx") # Now you can use this ONNX file in any ONNX-compatible runtime!

Model Export in OktoSeek IDE

When you train a model in OktoSeek IDE, it's automatically saved in the standard PyTorch/Hugging Face format. This format is compatible with the entire AI ecosystem and can be used directly with transformers, loaded in Python scripts, or converted to other formats as needed.

Default Export Format: The IDE exports models in the standard directory structure used by Hugging Face Transformers, including:

  • config.json - Model configuration
  • pytorch_model.bin or model.safetensors - Model weights
  • tokenizer.json and related files - Tokenizer configuration
  • training_args.bin - Training arguments used

This format ensures maximum compatibility and allows you to use your trained models anywhere PyTorch or Hugging Face Transformers is supported.

# Example: Loading a model exported from OktoSeek IDE from transformers import AutoModelForCausalLM, AutoTokenizer # Load your trained model (exported from IDE) model_path = "./okto_ai/models/my_trained_model" model = AutoModelForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) # Use your model immediately input_text = "Your prompt here" inputs = tokenizer(input_text, return_tensors='pt') outputs = model.generate(**inputs, max_length=100) result = tokenizer.decode(outputs[0]) print(result) # Your model is ready to use!

Note: While OktoSeek IDE exports in the standard PyTorch format, you can convert your models to ONNX or TFLite using external tools if needed for specific deployment scenarios (mobile apps, edge devices, etc.).

TFLite (TensorFlow Lite) - Optional Conversion

TensorFlow Lite is a lightweight solution for deploying machine learning models on mobile and edge devices. While OktoSeek IDE exports models in PyTorch format by default, you can convert them to TFLite for mobile deployment.

TFLite models are smaller and faster than full TensorFlow models, perfect for mobile apps and edge computing.

# Example: Converting TensorFlow Model to TFLite # Note: First convert PyTorch to TensorFlow, then to TFLite import tensorflow as tf # Load your saved model model = tf.keras.models.load_model('my_model.h5') # Convert to TFLite converter = tf.lite.TFLiteConverter.from_keras_model(model) # Optional: Optimize for size converter.optimizations = [tf.lite.Optimize.DEFAULT] # Convert tflite_model = converter.convert() # Save TFLite model with open('model.tflite', 'wb') as f: f.write(tflite_model) print("Model converted to TFLite!") # Now deploy to Android/iOS apps!

Creating Datasets from Videos

OktoStudio can extract training data from videos by analyzing frames and generating narrative descriptions. This is perfect for training models that understand visual storytelling or need to learn from video content.

Use Case: Upload a short film. The IDE extracts each scene, describes what's happening second-by-second, and creates training data that teaches your model to understand visual narratives.

# Example: Video Dataset Creation Process # 1. Upload video file video_path = "short_story.mp4" # 2. IDE extracts frames and generates descriptions: # Frame 0:00 - "A person walks into a room" # Frame 0:01 - "They approach a window" # Frame 0:02 - "Looking outside at the city" # ... and so on # 3. Creates training dataset: training_data = [ {"timestamp": "0:00", "description": "A person walks into a room"}, {"timestamp": "0:01", "description": "They approach a window"}, # ... ] # 4. Ready to train your model!

How OktoStudio Monitors Training

The IDE continuously monitors your training process, tracking loss, learning rate, gradient norms, and more. Our intelligent Agents analyze this data in real-time and provide recommendations to improve training.

Real-time Monitoring: Every step, the IDE collects metrics and updates graphs. The Agents analyze trends and detect issues like overfitting before they become problems.

# Example: How Agents Help During Training # The IDE monitors these metrics every step: metrics = { "loss": current_loss, "learning_rate": current_lr, "gradient_norm": gradient_magnitude, "weight_norm": weight_magnitude, "perplexity": model_uncertainty } # Agents analyze trends: if loss_not_decreasing: agent_suggestion = "Loss plateau detected. Consider adjusting learning rate." if overfitting_detected: agent_suggestion = "Overfitting detected. Suggest reducing dataset size or adding regularization."

Dataset Merging and Percentage Control

When you select multiple datasets, OktoStudio automatically merges them. You can specify what percentage of training data comes from each source. This is crucial for balanced training.

Example: Training a multilingual model? Use 40% English, 30% Spanish, 20% French, and 10% other languages. The IDE ensures each language gets proper representation during training.

# Example: Multi-dataset Training with Percentages # In OktoSeek IDE, you configure: dataset_config = { "english_corpus.txt": 40, # 40% of training data "spanish_corpus.txt": 30, # 30% of training data "french_corpus.txt": 20, # 20% of training data "other_languages.txt": 10 # 10% of training data } # IDE automatically: # - Loads all datasets # - Calculates total size # - Samples from each according to percentages # - Creates balanced training batches # - Ensures proper distribution across epochs

Training Speed: GPU vs CPU

Training speed depends entirely on your hardware. GPUs (Graphics Processing Units) are designed for parallel computation, making them ideal for AI training. CPUs can train models, but much slower.

GPU Training: With an NVIDIA GPU (RTX 3060 or better), training can be 10-50x faster. A model that takes 10 hours on CPU might take 20-60 minutes on GPU.

CPU Training: Still works! OktoStudio automatically detects your hardware and optimizes. If no GPU is available, training runs on CPU with appropriate batch sizes and optimizations.

# Example: Hardware Detection in OktoSeek IDE import torch # IDE automatically detects: if torch.cuda.is_available(): device = "cuda" gpu_name = torch.cuda.get_device_name(0) print(f"Training on GPU: {gpu_name}") # Optimize batch size for GPU batch_size = 8 # Larger batches on GPU else: device = "cpu" print("Training on CPU") # Optimize batch size for CPU batch_size = 2 # Smaller batches on CPU # Training time estimate adjusts automatically # GPU: ~20 minutes | CPU: ~10 hours (for same model)

How Agents Assist During Training

OktoSeek's intelligent Agents monitor your training in real-time. They analyze loss curves, detect overfitting, and suggest adjustments—all without revealing proprietary algorithms.

Overfitting Prevention: Agents track validation loss vs training loss. If training loss decreases but validation loss increases, they suggest reducing dataset size or adjusting learning rate.

Real-time Adjustments: Based on gradient norms and weight magnitudes, Agents can suggest when to stop training, when to adjust parameters, or when to change the learning rate schedule.

# Example: Agent Decision-Making Process # (Simplified - actual implementation is proprietary) def agent_analyze_training(training_history): # Analyze recent loss trend recent_losses = training_history[-100:] loss_trend = calculate_trend(recent_losses) # Check for overfitting if loss_trend == "plateau": return "Loss not decreasing. Consider adjusting learning rate." # Check gradient norms if gradient_norm > threshold: return "Gradient explosion detected. Suggest gradient clipping." # Check for overfitting if validation_loss > training_loss: return "Overfitting detected. Suggest early stopping or regularization." return "Training progressing well!"

OktoScript DSL: Declarative AI Configuration

OktoScript is a Domain-Specific Language (DSL) created by OktoSeek AI for defining AI training pipelines in a declarative, readable format. Instead of writing complex Python scripts, you describe what you want in structured blocks.

What is OktoScript?

OktoScript is not a general-purpose programming language—it's a declarative DSL designed specifically for AI pipelines. Think of it like Docker Compose for containers or SQL for databases—a specialized language for a specific domain.

With OktoScript, you can define complete AI training workflows including datasets, models, training parameters, fine-tuning (LoRA), monitoring, and export formats—all in a simple, version-control-friendly format.

Key Features

  • Declarative Syntax: Describe what you want, not how to do it
  • No Python Required: Complete pipelines without Python boilerplate
  • Version Control: Easy to read, diff, and review in Git
  • IDE Integration: Full syntax highlighting and validation in OktoSeek IDE
  • Reproducible: Every training run is exactly defined and repeatable
  • Dataset Mixing: Combine multiple datasets with weighted sampling (v1.1+)
  • LoRA Fine-tuning: Efficient adapter-based training (v1.1+)
  • Advanced Monitoring: System telemetry and training metrics

Example: Complete Training Pipeline

# okto_version: "1.1" PROJECT "MyModel" DATASET { mix_datasets: [ { path: "dataset/base.jsonl", weight: 70 }, { path: "dataset/extra.jsonl", weight: 30 } ] format: "jsonl" type: "chat" } MODEL { base: "oktoseek/base-llm-7b" architecture: "transformer" } FT_LORA { base_model: "oktoseek/base-llm-7b" lora_rank: 8 lora_alpha: 32 epochs: 5 device: "cuda" } MONITOR { level: "full" log_metrics: ["loss", "accuracy"] dashboard: true } EXPORT { format: ["okm", "onnx"] path: "export/" }

How OktoScript Relates to IDE Concepts

Everything you can do in the OktoSeek IDE visual interface can also be defined in OktoScript. The IDE automatically generates OktoScript from your visual configurations, and you can edit it directly for advanced control.

  • Dataset Creation: The DATASET block defines your training data, including mixing multiple datasets
  • Model Configuration: The MODEL block specifies your base model and architecture
  • Training Parameters: The TRAIN block controls epochs, batch size, learning rate, etc.
  • LoRA Fine-tuning: The FT_LORA block enables efficient adapter-based training
  • Monitoring: The MONITOR block configures system and training telemetry
  • Export Formats: The EXPORT block specifies output formats (ONNX, GGUF, OktoModel, etc.)

Why Use OktoScript?

While the visual IDE is perfect for beginners and quick prototyping, OktoScript is ideal for:

  • Advanced Users: Fine-grained control over every aspect of training
  • Reproducible Research: Version-controlled, exact configurations
  • Automation: Script-based workflows and CI/CD integration
  • Collaboration: Share and review configurations in Git
  • Complex Pipelines: Multi-dataset mixing, advanced monitoring, custom hooks

→ Learn More About OktoScript

← Back to Home