LLM Architecture & Model Compression Research

Project Overview

An ongoing experimental project exploring large language models through hands-on fine-tuning, compression, and deployment - four threads running in parallel:

Capacity-threshold study across the Qwen 0.5B - 7B sweep on low-resource Indic languages, with 72B teacher-student distillation pulling clean training data out of noisy corpora. Headline finding: 0.5B degrades to syntactic mimicry on noisy data; 7B (14× the parameters) unlocks semantic instruction-following on the same data.
On-device deployment of Gemma 3 1B on the 4GB NVIDIA Jetson Nano via 8-bit quantization plus LoRA - a real-world stress test of the accuracy / latency / memory frontier.
Quantization & model compression subproject pushing that frontier further (PTQ, QAT, AWQ, GPTQ, pruning, 2:4 structured sparsity) across the 0.5B - 7B sweep and across Jetson Nano, Jetson Orin, and consumer GPUs.
Diffusion-based language models asking what a non-autoregressive generative paradigm buys on math reasoning, coding capability, and model alignment via the denoising objective.

Indic text generation sits downstream of all four threads - a model that can reason, code, compress, and stay aligned is a far better candidate for low-resource languages than one that just memorises tokens.

Exp. Node Visualization

My work involves contrasting models across a wide parameter spectrum (1B → 7B), systematically analyzing the effects of architectural constraints, quantization, and model compression techniques. Rather than treating compression purely as an optimization step after training, the project uses it as an analytical tool to study information bottlenecks, representation degradation, and architectural redundancy within language models.

Experimental Focus Areas:

Fine-tuning models across the parameter spectrum (0.5B → 7B) for low-resource languages
Testing quantization and compression techniques (QLoRA, 4-bit/8-bit precision)
Edge deployment feasibility on constrained hardware (Jetson Nano, limited VRAM)
Understanding capacity thresholds for instruction-following vs. syntactic mimicry
Evaluating trade-offs between model size, latency, and semantic competence

Motivation

Current LLMs range from 1B to over 400B parameters, with larger models typically performing better but requiring significant computational resources. Through hands-on experimentation, I'm exploring the practical limits of compression and edge deployment, addressing challenges like:

Edge devices and mobile applications
Real-time inference systems with latency constraints
Organizations with limited computational budgets
Privacy-critical applications requiring on-device inference

By experimenting with these techniques, the goal is to understand what's actually feasible for deploying intelligent language models in resource-constrained environments-from mobile devices to edge hardware.

Experimental Directions

Quantization & Edge Deployment

Reducing model precision (32-bit → 8-bit → 4-bit) via QLoRA and quantization-aware training. Testing deployment on constrained edge devices with minimal performance degradation.

Teacher-Student Distillation

Using large teacher models (Qwen 72B) to generate high-quality instruction data, enabling smaller models to learn semantic instruction-following behaviors through supervised fine-tuning.

Multilingual Fine-Tuning

Investigating SFT effectiveness across low-resource languages (Hindi, Kannada) to understand parameter thresholds for semantic grounding in non-English contexts.

Sparse Layer Adaptation

LoRA and parameter-efficient fine-tuning to enable training on constrained hardware (cloud GPUs and edge devices) while maintaining model quality and reducing memory footprint.

Experiment Indic-V1: Architectural Trade-Offs in Fine-Tuning Qwen for Low-Resource Languages

January 2026 · Experimental Project
Focus: Multilingual Fine-Tuning, Capacity Thresholds, and Low-Resource Deployment

Experiment Hypothesis

Deploying Large Language Models on edge devices requires aggressive architectural constraints, often at the cost of representational density. This experiment was driven by a specific hypothesis: Can a highly constrained, sub-1-billion parameter model (Qwen 0.5B) successfully learn Indic instruction-following behaviors via Supervised Fine-Tuning (SFT), or is there a hard parameter threshold required for multilingual semantic grounding?

To test this, I designed a controlled experiment using:

Data Control: A unified, high-quality synthetic dataset generated via teacher-student distillation
Architecture Variable: Contrasting a "Toy" architecture (0.5B) against a "Production" architecture (7B)
Compute Constraint: Training on a single GPU node

Key Finding: Model capacity acts as a hard boundary for semantic instruction following. The Qwen 0.5B model exhibited catastrophic forgetting and mode collapse (Syntactic Mimicry), while the Qwen 7B model demonstrated emergent intelligence through accurate factual retrieval and linguistic alignment-despite being trained on identical noisy data.

1. The Data Bottleneck: Taming 95GB of Raw Corpora

The foundation of this experiment required high-quality instruction-response pairs for Hindi and Kannada. We sourced raw datasets from the AI4Bharat IndicCorp v2 dataset, but they were massive and unstructured:

Kannada: ~20GB of raw text (kn.txt)
Hindi: ~75GB of raw text split across multiple files (hi_1.txt, hi_2.txt, hi_3.txt)

Manual cleaning was mathematically infeasible. Initial attempts to train directly on raw subsets resulted in simple text-continuation behavior rather than instruction-following-the models merely predicted the next word, not answers to questions.

The Teacher-Student Distillation Pipeline

To bypass manual curation, I deployed the massive Qwen 2.5 72B Instruct model as a "Teacher" to act as a semantic filter and transmute raw text into structured, high-quality instruction-response pairs.

Pipeline Architecture:

Sampling: Extracted 6,000 lines from raw text files for both Hindi and Kannada
Structuring: The 72B model converted raw text into standard Alpaca-style JSONL format
Constraint: Characters per paragraph capped at 5,000 to ensure concise context usage

Hardware & Implementation Details

Due to massive VRAM requirements, the 72B model could not be loaded natively:

Compute: a single GPU node
Sharding: 72B parameter model broken into 37 shards to fit into memory
Batch Sizing: Initial 24 parallel generations crashed the system; throttled to batch size 16
Context Windows: Token limit increased from 64/256 to 512 tokens mid-process to handle agglutinative Kannada structures without truncation
Processing Time: 15-19 hours for Kannada subset alone
Output: 8,366 high-quality instruction pairs (≈4,500 Kannada, ≈3,800 Hindi)

Sample Generated Data:

KANNADA (GOVERNMENT & INFRASTRUCTURE)
{
 "instruction": "ಜಿಲ್ಲಾ ಕ್ರೀಡಾಂಗಣದಲ್ಲಿ ನಡೆದ ಶಂಕುಸ್ಥಾಪನೆ ಹಾಗೂ ಉದ್ಘಾಟನಾ ಸಮಾರಂಭ ಯೋಜನೆಯ ಬೆಲೆ ಎಷ್ಟು?",
 "input": "",
 "output": "ಜಿಲ್ಲಾ ಕ್ರೀಡಾಂಗಣದಲ್ಲಿ ಭಾನುವಾರ 1,650 ಕೋಟಿ ರೂ. ವೆಚ್ಚದ ಹಲವು ಅಭಿವೃದ್ಧಿ ಯೋಜನೆಗಳ ಶಂಕುಸ್ಥಾಪನೆ ಹಾಗೂ ಉದ್ಘಾಟನಾ ಸಮಾರಂಭದಲ್ಲಿ ಮಾತನಾಡಿದರು."
}

HINDI (AUTOMOTIVE)
{
 "instruction": "मारुति बलेनो पर बड़ा डिस्काउंट किस उद्देश्य से लाया गया है?",
 "input": "",
 "output": "इसके साथ ही कई जगहों पर मारुति बलेनो में कई तरह के डिस्काउंट व ऑफर भी उपलब्ध कराये जा रहे है। माना जा रहा है कि यह बड़ा डिस्काउंट इसके स्टॉक को जल्द खत्म करने के लिए लाया गया है।"
}

HINDI (ECONOMICS)
{
 "instruction": "चीन की अर्थव्यवस्था आगामी वर्ष में बांडों की अदायगी में डिफॉल्ट की समस्या का सामना कैसे करेगी?",
 "input": "",
 "output": "यह समस्या जितनी नजर आ रही है उससे भी कहीं अधिक गंभीर है क्योंकि चीन की 134 अरब डॉलर की बांड परिपक्वता को देखते हुए आगामी वर्ष में इसकी अदायगी में और भी अधिक डिफॉल्ट हो सकते हैं।"
}

2. Phase I: The Capacity Bottleneck (Qwen 0.5B)

The initial fine-tuning phase utilized the Qwen 2.5 0.5B Instruct model to test whether a highly constrained architecture could learn instruction-following behavior.

Debugging the Missing Labels Bug

Training initially crashed with ValueError: The model did not return a loss. The dataset was tokenized successfully but lacked a labels column, meaning the model received input sequences but had no "correct answer" against which to calculate error.

Solution: Explicitly copy input_ids to labels during data mapping phase:

 def process_data(sample):
 text = format_instruction(sample)
 tokenized = tokenizer(text, truncation=True, max_length=512)
 tokenized["labels"] = tokenized["input_ids"].copy()
 return tokenized

dataset = dataset.map(process_data, remove_columns=dataset.column_names)
 

Training Dynamics & Mode Collapse

Once fixed, training on the A100 completed quickly, in ~20 minutes with a throughput of ~1.6 seconds per step. However, inference results revealed catastrophic forgetting and mode collapse. The loss curve appeared healthy, but the model had learned to be a sophisticated noise repeater rather than an instruction follower.

What catastrophic forgetting meant here. The 0.5B model was fine-tuned with QLoRA (a 4-bit quantized base with trainable low-rank adapters) on the raw multilingual news corpus. Fine-tuning pulls the weights toward the new objective - next-token prediction over noisy news text - and in doing so it overwrote the instruction-following behaviour the pre-trained Instruct model already had. The model did not add a skill; it traded one away. Ask it a question and it answered with a fabricated news snippet, because predicting news text was now the only thing it knew how to do.

The cause is capacity. A 0.5B network has almost no spare representational budget: the corpus-fitting signal and the pre-trained instruction-following circuit compete for the same parameters, and the corpus signal won. QLoRA normally guards against this - it freezes the base model and trains only small adapters - but the base here is so small that the adapter's shift dominated the output distribution at inference, and the model collapsed into surface-level mimicry of the training text. The trap was the loss curve: it looked healthy because the model was learning - just the wrong thing. Training loss measured how well it fit the news corpus, not whether it could still follow an instruction. A clean loss curve and a forgotten task can coexist, and that is exactly what happened here.

Test Query	0.5B Response	Diagnosis
"Who are you?"	"Said 27: The police arrested..."	Syntactic Mimicry: Regurgitated reporting patterns instead of grasping semantic identity
"India Capital?"	"Modi visited the capital on..."	Associative Hallucination: Triggered statistical association between "Capital" and "Politics" instead of factual retrieval
"Karnataka CM?"	"Karnataka, also known as the 'Red Hills', is one of India's most populous states..."	Empty Head Problem: Lacking parameters for world knowledge, hallucinated nonsensical monikers to fill void

Detailed Failure Case: Identity Loss (Hindi)

 User: नमस्ते, आप कौन हैं? (Hello, who are you?)
AI Output: आज सुबह 10:35 बाद में प्रधानमंत्री डा. रणवीर गोयल अभियान के लिए भगवान शिव की जड़ और उच्च इशारा के खिलाफ...

Analysis: The model generated statistically probable Hindi text (mimicking a news snippet about a campaign) but completely ignored the question. It hallucinated a "Dr. Ranveer Goyal" and fabricated timestamp (10:35), reproducing the stylistic patterns of the raw training data (news articles, timestamps, official names).
 

The "Parrot" Phenomenon: With limited parameter space, the 0.5B model behaved like a linguistic echo chamber. It lacked the cognitive depth to understand the abstract concept of "Question & Answer." Instead, it focused on surface-level features-dates, colons, reporter syntax-and essentially memorized the noise in the dataset rather than learning the underlying task of instruction-following.

3. Phase II: Scaling to Emergent Intelligence (Qwen 7B)

Realizing that a 0.5B architecture lacked requisite "neurons" for factual storage or semantic abstraction, the experiment escalated to the Qwen 2.5 7B Instruct model-a 14x parameter increase.

QLoRA Configuration

Model fine-tuned using Quantized Low-Rank Adaptation to fit within compute budget:

Base Model Quantization: 4-bit NF4 (Normal Float)
LoRA Adapters: 16-bit (Rank 32, Alpha 64)
Trainable Parameters: 80,740,352 (~80 Million out of 7B total)
Training Time: 2 hours 30 minutes (~6.6 sec/step throughput)
Final Loss: 0.61 (The "Goldilocks Zone"-indicating learning without overfitting or memorization)

Inference Results: From Mimicry to Understanding

The 7B model demonstrated a qualitative leap in capability:

Sample 1: Fact Retrieval (Perfect Accuracy)

 User: भारत की राजधानी क्या है? (What is the capital of India?)
AI: भारत की राजधानी दिल्ली है। (The capital of India is Delhi.)

Verdict: ✓ 100% Perfect. Concise and factually correct.
 

Sample 2: Hallucination vs. Structure (90% Success)

 User: ಕರ್ನಾಟಕದ ಮುಖ್ಯಮಂತ್ರಿ ಯಾರು? (Who is the CM of Karnataka?)
AI: ಕರ್ನಾಟಕದ ಮುಖ್ಯಮಂತ್ರಿ ಸಿದ್ದರಾಮಯ್ಯ ಅವರು ಪ್ರಸ್ತುತವಾಗಿ ಬೆಳಗಾವಿಯಲ್ಲಿ ಉಪಸ್ಥಿತರು.

Analysis:
• ✓ Success: Correctly identified Siddaramaiah, proving it possesses the pre-trained world knowledge the 0.5B lacked
• ✗ Hallucination: The phrase "currently present in Belagavi" is a stylistic addon. The model adopted a news-reporter persona from the training data, generating a live update rather than just answering.
 

Sample 3: Cultural Alignment & Persona Enforcement

 User: Who are you? (in English)
AI: आप ने मुझे क्यों पूछा है? मैं आपकी सहायता करने के लिए यहाँ हूँ।
(Why did you ask me? I am here to help you.)

Verdict: ✓ Pass (with Attitude). The model prioritized the Indic-focused system prompt over user's English input, answering in Hindi. Strong adherence to injected persona.
 

The "Student" Phenomenon: At 7B parameters, the model exhibited emergent intelligence. It acted like a university student who could look at the same noisy training data, identify the underlying Q&A pattern, ignore the irrelevant syntactic fluff, and understand the intent of the instruction. Model capacity enables semantic abstraction rather than mere syntactic imitation.

Inference Architecture & System Prompt

To enforce language boundaries and persona constraints, a robust system prompt was designed:

 SYSTEM_PROMPT = """You are a helpful, culturally aware AI assistant 
designed to speak fluent Hindi and Kannada.
Your goal is to answer the user's questions accurately 
in the language they used (Hindi or Kannada).

GUIDELINES:
1. If the user asks in Hindi, reply in Hindi (Devanagari script).
2. If the user asks in Kannada, reply in Kannada (Kannada script).
3. Be concise and factual. Do not hallucinate news reports.
4. For technical topics, use English terms but explain in target language."""
 

Inference was configured with temperature 0.3 (favoring factual responses), top_p=0.9, and repetition_penalty=1.1 to prevent output looping.

4. Architectural Analysis & Key Insights

Insight 1: The "Empty Head" Problem-Capacity as a Hard Boundary

A critical finding: Models cannot be fine-tuned to answer factual questions they never learned during pre-training. The 0.5B model's failure to identify the CM of Karnataka or accurately describe the state illustrates this limit. When asked about "Red Hills" (a fictional moniker), the model hallucinated because it had no pre-trained knowledge to retrieve.

Why LoRA Works for 7B but Not 0.5B: Fine-tuning via LoRA is fundamentally about teaching a new style or language, not injecting new facts. The 7B model succeeded because it already possessed "Siddaramaiah" in its pre-trained weights; the LoRA adapter simply taught it to express that knowledge in Kannada. The 0.5B model had no knowledge to retrieve, so it hallucinated plausible-sounding nonsense instead.

Insight 2: Data Sensitivity & Noise Robustness Scale with Capacity

Noise Intolerance (0.5B): With limited parameter space, the 0.5B model was highly sensitive to data artifacts. Because ~5% of the training data resembled news reports (due to IndicCorp sourcing), the model overfitted to this pattern and became a "News Bot" that generated timestamps and reporter syntax.

Noise Filtering (7B): Despite training on identical noisy data, the 7B model demonstrated robustness by largely filtering out the reporter style. Its massive pre-trained signal for instruction-following outweighed the noise in the fine-tuning set. Hallucinations appeared ("currently present in Belagavi") but were minor stylistic addons rather than complete mode collapse.

Insight 3: Capacity Gates Semantic Abstraction

The difference between 0.5B and 7B is not just "more memory." It's a qualitative difference in abstraction capability:

0.5B: Operates at the syntactic level-learning surface patterns and correlations
7B: Operates at the semantic level-understanding intent, context, and factual grounding

This aligns with findings in mechanistic interpretability: scaling enables the emergence of abstraction capabilities that don't exist at smaller scales, even with the same training data.

Key Takeaway from Indic-V1: Model capacity acts as a hard threshold for semantic instruction following in multilingual contexts. A 14x parameter increase (0.5B → 7B) transforms a model from a syntactic mimic to a semantic reasoner. While the 0.5B model offers speed and efficiency, it sacrifices the representational density required for robust instruction-following behavior. This has direct implications for edge deployment: models must exceed a minimum capacity threshold to perform reliably on open-ended tasks, regardless of fine-tuning data quality.

Experiment Gemma 3 1B: On-Device Fine-Tuning & Inference on NVIDIA Jetson Nano

Experiment Motivation

While the Qwen 7B experiments demonstrated scalability on high-end hardware, the fundamental question remains: Can we fine-tune and deploy instruction-following models on ultra-constrained edge hardware? This experiment tests the extreme endpoint of model compression and edge deployment feasibility.

Hardware Constraints

Device: NVIDIA Jetson Nano Developer Kit
RAM: 4GB (system + GPU shared memory)
Storage: Limited internal storage; offloading to USB storage
Architecture: ARM-based (mobile/edge-optimized instruction set)
Compute: 128 NVIDIA CUDA cores

Model & Optimization Strategy

Model Choice: Google Gemma 3 1B (1 billion parameters, pre-trained on diverse multilingual data)

Optimization Pipeline:

Base Quantization: 8-bit (post-training quantization via qint8)
LoRA Adaptation: Rank 8, Alpha 16 (minimal memory footprint)
Batch Size: 1 (memory constraint requires sequential processing)
Gradient Accumulation: Used to simulate larger batch sizes without memory overhead
Max Sequence Length: 256 tokens (reduced from standard 512 due to 4GB RAM)

Fine-Tuning Results

Training Performance:

Training Duration: ~4-6 hours per epoch on 2K instruction pairs
Throughput: ~0.3-0.5 samples/second (extremely slow but viable)
Memory Peak: 3.8GB / 4GB (nearly saturated but stable)
Convergence: Loss plateaued around 1.2 (higher than desktop training due to reduced capacity)

Key Achievement: Successfully trained a 1B parameter model on 4GB RAM without GPU Out-of-Memory errors. This validates that edge devices can perform model adaptation in real-time or at deployment time, enabling personalization without cloud dependencies.

Inference on Jetson Nano

Inference Configuration:

Memory Usage: ~2.2GB (model weights + activations)
Inference Latency: Single token: ~200-300ms; Full response (64 tokens): ~15-20 seconds
Temperature: 0.5 (balancing coherence and diversity)
Max New Tokens: 64 (practical limit for interactive use)

Sample Inference Results:

Query 1: English Instruction

User: "Explain machine learning in simple terms."

Gemma 3 1B (Jetson Nano): "Machine learning is a type of artificial intelligence where computers learn patterns from data instead of being explicitly programmed. The model improves its performance as it sees more examples, similar to how humans learn through experience."

✓ Coherent, concise explanation; appropriate for edge device

Query 2: Multilingual Query (Hindi)

User: "भारत में सबसे बड़ा व्यावसायिक शहर कौन सा है?" (Which is the largest commercial city in India?)

Gemma 3 1B (Jetson Nano): "मुंबई भारत में सबसे बड़ा व्यावसायिक शहर है। यह देश का वित्तीय केंद्र भी माना जाता है।"

✓ Correct multilingual response; maintained context from fine-tuning

Performance Benchmarking

Metric	Value	Notes
Model Size	~2.4GB (8-bit quantized)	Reduced from 4GB full precision
Tokens/Second	~3-5 tok/s	Suitable for interactive assistants
End-to-End Latency (64 tok)	15-20 seconds	Acceptable for resource-constrained scenarios
Power Consumption	~5-7W (inference)	Battery-efficient for mobile/embedded use
Fine-Tuning Duration	4-6 hours/epoch	On-device adaptation possible

Jetson Nano Deployment: What Worked

Edge Deployment Feasibility: A 1B parameter model can be fine-tuned, deployed, and run inference on 4GB Jetson Nano hardware using 8-bit quantization and LoRA. Inference latency (15-20 sec for 64 tokens) is acceptable for non-real-time applications (chatbots, document summarization, local RAG systems).

Practical Takeaway: Organizations can now deploy personalized LLMs on edge devices without cloud dependency, enabling privacy-preserving local inference for robotics, IoT, and embedded AI systems.

Active Subproject: Quantization & Model Compression Research

A second open thread under this project focuses specifically on quantization and model compression for production-grade edge deployment. The Jetson Nano work above proved that a 1B model can be fine-tuned and served on 4 GB of RAM with naive 8-bit quantization plus LoRA - the question this subproject asks is how much further the accuracy / latency / memory frontier can be pushed before the model breaks.

Methods under comparison on the Qwen and Gemma families:

Post-training quantization (PTQ) at 8-bit, 4-bit, and (where feasible) 3-bit weight precision, with per-channel and per-group scaling.
Quantization-aware training (QAT) - inserting fake-quant ops into fine-tuning to recover accuracy lost by PTQ at aggressive bit-widths.
AWQ and GPTQ - activation-aware and Hessian-based weight quantization for low-bit LLM-specific schemes.
Pruning and structured sparsity - magnitude-based unstructured pruning plus 2:4 structured sparsity for further footprint reduction.
Mixed-precision and weight-only schemes - keeping activations at higher precision while compressing weights, where the bottleneck is memory bandwidth rather than compute.

The target output is a clean accuracy / latency / memory Pareto frontier across the 0.5B - 7B Qwen / Gemma sweep, with explicit deployment recommendations per bit-width and per piece of hardware (Jetson Nano, Jetson Orin, consumer GPUs). Connects directly to the KD work at Erdős AI Lab: distillation tells you the minimum width below which loss collapses, quantization tells you the minimum precision below which loss collapses, and the deployment frontier is the joint of those two limits.

Current Direction: Diffusion-Based Language Models

The newest thread under this project moves from autoregressive scaling to diffusion-based language models. Where the Qwen / Gemma work above asks "what does parameter count buy you?", the diffusion work asks "what does a different generative paradigm buy you on the same scale of compute?"

Three concrete fronts are open:

Math reasoning. Whether diffusion's iterative refinement helps multi-step arithmetic and proof-style problems where autoregressive models still struggle with backtracking and self-correction.
Coding capability. Whether diffusion's bidirectional context (it doesn't have to commit left-to-right) translates to better code synthesis and edit-style tasks - infilling, refactoring, and constrained generation in particular.
Improved model alignment. Whether the denoising training objective and inference-time control surface give a cleaner place to inject alignment constraints than RLHF on top of an autoregressive base.

Indic text generation (the original motivation) sits downstream of all three: a model that can reason, code, and stay aligned is a far better candidate for low-resource languages than one that just memorises tokens. This thread is exploratory and active - early results, no claims yet.

Key Learnings & Observations

Indic-V1 (Qwen): Model capacity is a hard boundary for semantic grounding. A 14x parameter increase (0.5B → 7B) transformed the model from a "Parrot" (syntactic mimicry) to a "Student" (semantic understanding). Teacher-student distillation successfully generated high-quality training data from raw corpora, enabling supervised fine-tuning on low-resource languages.

Gemma 3 on Jetson Nano: Sub-2GB quantized models with LoRA can be fine-tuned and deployed on 4GB edge hardware. Latency is acceptable for non-real-time applications, enabling privacy-preserving personalization on edge devices without cloud inference.

Ongoing Exploration: The trade-off between model capacity, quantization, and task-specific fine-tuning defines the deployment frontier. Through continued experimentation, the goal is to map out what's actually feasible across the extreme parameter range (1B → 7B) and hardware spectrum (edge → data center), while identifying where semantic competence breaks down.

Reference Papers & Learning Resources

Foundational papers and resources informing these experiments:

Attention Is All You Need - Vaswani et al., 2017

Foundation for all transformer-based LLMs; critical for understanding attention mechanisms in fine-tuning.
Language Models are Few-Shot Learners (GPT-3) - Brown et al., 2020

Demonstrates emergent capabilities at scale; motivates the parameter-scaling hypothesis tested in Indic-V1.
Training Compute-Optimal Large Language Models (Chinchilla) - Hoffmann et al., 2022

Optimal compute budget allocation; informs architectural trade-off decisions in capacity-constrained scenarios.
QLoRA: Efficient Finetuning of Quantized LLMs - Dettmers et al., 2023

Core technique used in both Qwen 7B (Indic-V1) and Gemma 3 1B (Jetson Nano) fine-tuning experiments; enables training on constrained hardware.
ALBERT: A Lite BERT for Self-supervised Learning - Lan et al., 2019

Parameter sharing and factorization techniques; informs lightweight architecture design for edge deployment.
Distilling the Knowledge in a Neural Network - Hinton, Vanhoucke, Dean, 2015

Foundation for teacher-student distillation; directly applied in Indic-V1 data curation pipeline (Qwen 72B → student data).
LoRA: Low-Rank Adaptation of Large Language Models - Hu et al., 2021

Parameter-efficient fine-tuning enabling both Indic-V1 and Jetson Nano experiments; core to all deployment scenarios in this research.
The Llama 2 Open Foundation and Fine-Tuned Chat Models - Touvron et al., 2023

Instruction fine-tuning methodology; informs SFT strategy for both Indic language and edge deployment experiments.
Gemma: Open Models Based on Gemini Research and Technology - Google DeepMind, 2024

Architecture and pre-training details for Gemma 3 1B; used as foundation for Jetson Nano deployment research.