Mohammed Firdous

Qwen3.5-4B Base Blind Spots is a systematic evaluation framework for probing failure modes in Qwen3.5-4B-Base, a 4B-parameter pre-trained model with Gated DeltaNet attention and sparse MoE architecture. Rather than collecting anecdotal failures, it runs 84 structured prompts across 12 reasoning categories and produces quantified failure rates with actionable remediation paths.

What it is

A reproducible LLM failure-mode evaluation framework with:

84-Prompt Test Suite: 5 failure probes + 2 success controls per category across 12 reasoning domains: arithmetic, logic, commonsense, coreference, Bayesian reasoning, Winograd schema, and more.
Dual-Axis Classification: Each output labelled independently for coherence (intelligible?) and correctness (accurate?), separating format failures from reasoning failures to make remediation more targeted.
Critical dtype Finding: float16 triggers numerical overflow across all 12 categories, producing incoherent output even on trivial controls. bfloat16 is required; float16 results are preserved as a reference.
Fine-Tuning Roadmap: Category-level failure analysis maps each root cause to specific training datasets and scale ranges for targeted improvement.

How It's Built

Model: Qwen/Qwen3.5-4B-Base via Hugging Face Transformers.
Execution: colab_notebook.ipynb (Google Colab L4/A10G, bfloat16) or modal_runner.py for cloud GPU.
Decoding: Greedy sampling for reproducibility.
Data Format: JSONL: prompts.jsonl (84 canonical inputs), train.jsonl (full records with per-dtype outputs and auto-labels), blind_spots_data.jsonl (labelled results).
Published Artifact: Training dataset released on Hugging Face for reuse and extension.

What I Learned

dtype Is a Hard Correctness Requirement: float16 overflow is not subtle degradation, it invalidates all results. Dtype selection must be verified before any evaluation is trusted.
Structured Probing Over Anecdote: Controlled failure/success pairs per category turn vague quality concerns into measurable, comparable failure rates.
Dual-Axis Taxonomy Drives Different Fixes: Coherence failures point to inference config or instruction tuning; correctness failures point to chain-of-thought or domain-specific data. Conflating them wastes remediation effort.
Small Benchmarks Can Be High Signal: 84 prompts across 12 categories surface recurring blind spots faster than large-scale evaluation suites when the goal is targeted fine-tuning.