Glossary¶
This page defines project terms as they are used in the pccx public documentation. It is intentionally conservative: planned work, throughput targets, and board measurements are labelled as such.
Project And Release Lines¶
- pccx
Parallel Compute Core eXecutor. A hardware-software co-design project for NPU architectures targeting edge inference workloads.
- v001
Archived experimental pccx architecture line. It remains in the docs as historical context and should not be treated as the active RTL target.
- v002
Active KV260 LLM architecture line. In this docs site,
v002usually means the public architecture, ISA, driver, RTL-reference, and verification pages for the currentpccx-FPGA-NPU-LLM-kv260line.- v002.0
Baseline v002 integration line on KV260. Throughput language for this line is measured-only until release evidence is published.
- v002.1
Planned continuation of v002 on the same RTL repository. The roadmap scopes sparsity and speculative-decoding work to this line. The 20 tok/s number is a target for this line, not a reported board result.
- v003.x
Planned LLM continuation in a separate RTL repository. Public documentation treats v003 as a future line until its repository and release branches are stabilized.
- vision-v001
Parallel CNN inference track that reuses the KV260 substrate but targets vision workloads rather than autoregressive LLM decoding.
- pccx-lab
Companion verification and profiling environment for pccx traces, reports, and workflow automation. Public claims derived from lab output still need the release evidence gates described in the roadmap.
- pccx-llm-launcher
Companion launcher repository for model preparation, runtime contracts, and KV260-facing orchestration. Current public launcher pages describe scaffold, mock, and contract surfaces unless they cite board evidence.
Hardware Target¶
- KV260
Xilinx Kria KV260 Starter Kit, based on the Zynq UltraScale+ ZU5EV device. It is the primary board target for v002 public documentation.
kv260Lowercase slug used in repository names, branch names, build directories, or scripts when a filesystem-safe target identifier is needed.
- Zynq UltraScale+
AMD/Xilinx SoC family that combines a Processing System and Programmable Logic fabric. The KV260 target uses a ZU5EV part.
- PS
Processing System. The Arm-based host side of the Zynq device.
- PL
Programmable Logic. The FPGA fabric side where the pccx NPU RTL is implemented.
- AXI
Arm AMBA interconnect protocol family used for host, memory, and streaming interfaces in the design.
- AXI-HP
High-Performance AXI ports from the PS to PL. In v002 documentation these ports are used for high-bandwidth weight traffic into the NPU.
- ACP
Accelerator Coherency Port. In pccx docs, ACP refers to the coherent path used for activation/result traffic between host memory and the accelerator.
- DSP48E2
Xilinx DSP slice available in UltraScale+ devices. pccx v002 uses DSP48E2 packing for the W4A8 GEMM datapath.
- BRAM
Block RAM in the FPGA fabric. pccx uses BRAM for smaller local buffers and per-core storage structures.
- URAM
UltraRAM in the FPGA fabric. pccx v002 uses URAM for the shared L2 cache and weight buffering structures described in the architecture docs.
- CDC
Clock-domain crossing. Used where data moves between the AXI/control clock domain and the core compute clock domain.
- Vivado block design
Xilinx Vivado IP-integrator design graph. In the v002.1 docs, a block-design scaffold is build setup material, not proof that implementation or timing has completed.
- bitstream
FPGA configuration artifact produced after synthesis and implementation. Public pccx docs should call a bitstream deployable only when the matching evidence page or release checklist links the build, timing, and board artefacts.
- SD staging
Packaging step that prepares files for booting or testing the KV260 from SD media. It is a deploy-preparation step and does not by itself establish a hardware run.
Data Types And Numeric Formats¶
- W4A8
Weight-4, Activation-8 quantization. In pccx v002 this means INT4 weights multiplied by INT8 activations on the main integer compute path.
- W4A8KV4
Shorthand used for an evidence-gated Gemma 3N E4B target configuration: W4A8 compute with 4-bit KV-cache storage. Treat it as a target configuration label unless a page cites measured evidence.
- INT4
Signed 4-bit integer value, used for quantized weights in the W4A8 path.
- INT8
Signed 8-bit integer value, used for quantized activations in the W4A8 path.
- BF16
Brain floating point format with an 8-bit exponent and 7-bit mantissa. pccx docs use BF16 for activation, KV-cache, or SFU paths where integer-only arithmetic is not the intended representation.
- FP32
IEEE single-precision floating point. Public docs mention FP32 only where the operation needs a higher-precision software or SFU-side representation.
- Precision promotion
Conversion from the integer compute path to BF16 or FP32 for non-linear or numerically sensitive operations such as softmax, RMSNorm, GELU, and RoPE.
- Sign recovery
The correction step used when signed low-bit operands are packed into a wider multiply datapath. In pccx docs the term is tied to W4A8 DSP packing, not to model-level accuracy claims.
- Activation quantization
Policy for converting activation values into the representation consumed by the integer datapath. The v002.1 decision page names the default policy but does not claim final model accuracy.
e_maxMaximum-exponent summary used by the v002.1 activation-scale policy. Public docs describe it as a scale-selection mechanism, not as measured accuracy or throughput evidence.
- BFP
Block floating point. In the v002.1 activation policy, BFP refers to a shared power-of-two activation scale for a block of values.
- symmetric INT8
Reviewed activation-scale mode that uses symmetric signed INT8 quantization. The design-decision page keeps it as a mode under review rather than the v002.1 default.
- constant-cache scale
Driver-provided activation-scale table or constant path. It remains a reviewed mode until the hardware/software interface and tests make it the chosen default.
ACT_SCALE_POLICYPublic parameter handle for the v002.1 activation scaling policy.
ACT_SCALE_EMAX_BFPDefault v002.1 activation-scale mode named by the design-decision page:
e_maxplus BFP power-of-two scaling.
Compute Blocks¶
- GEMM
General Matrix-Matrix Multiply. In v002 it is the matrix core used mainly for prefill and other matrix-heavy work. The architecture docs describe a 32 x 32 systolic array for the KV260 configuration.
- GEMV
General Matrix-Vector Multiply. In v002 it is the vector core used for decode-dominant work where a new token repeatedly multiplies an activation vector by streamed weights.
- CVO
Complex Vector Op. ISA opcode family for non-linear vector operations and reductions that execute on the SFU path.
- SFU
Special Function Unit. The backend that executes CVO operations such as exp, sqrt, GELU, sin/cos, reduce-sum, scale, and reciprocal.
- PE
Processing Element. A compute cell in the systolic array or related datapath.
- Systolic array
Regular grid of PEs that moves operands through a fixed pattern. In pccx v002 public docs, this term usually refers to the GEMM array.
- Weight Stationary
GEMM dataflow where a weight tile is loaded into the array and reused across many activation steps.
- Weight Streaming
GEMV dataflow where weights stream through the vector datapath because each weight is used once for the current token step.
- LUT
Lookup table. In the FPGA sense, LUTs are logic resources. In the algorithmic sense, pccx docs also use lookup tables for some dequantization or SFU helper paths; read the local context.
- CORDIC
Iterative coordinate-rotation method used for selected transcendental functions. pccx docs mention CORDIC as part of the SFU implementation path.
- K-split
Division of the reduction dimension into chunks. v002.1 docs discuss it with drain cadence and accumulator bounds, not as a completed scheduler claim.
- drain cadence
Frequency at which partial accumulators are drained from a K-split path. The current v002.1 default is parameterized rather than hardwired into a public performance claim.
K_DRAIN_LIMITPublic parameter handle for the v002.1 K-split accumulator drain limit. The documented default is
1024.- DSP accounting baseline
Convention for reporting intended compute-core DSP usage separately from implementation extras. Actual utilization still comes from synthesis reports.
DSP_BASELINE_GEMMGEMM compute-core DSP baseline parameter. The v002.1 decision page sets it to
1024for the 32 x 32 PE grid.DSP_BASELINE_GEMVGEMV compute-core DSP baseline parameter. The v002.1 decision page sets it to
64for four 16-DSP vector lanes.DSP_BASELINE_ALPHAAccounting bucket for implementation extras outside the GEMM/GEMV baseline.
ISA And Runtime Terms¶
- ISA
Instruction Set Architecture. pccx v002 uses a custom fixed-width 64-bit ISA for compute, memory, and CVO instructions.
- VLIW
Very Long Instruction Word. In pccx docs this describes the fixed-width instruction format and explicit fields used by the NPU dispatcher.
- opcode
Operation-code field in an instruction. The v002 ISA pages are the source of truth for opcode values and instruction field layouts.
- GEMM instruction
v002 ISA compute instruction that dispatches matrix-matrix work to the GEMM backend.
- GEMV instruction
v002 ISA compute instruction that dispatches matrix-vector work to the GEMV backend.
- MEMCPY instruction
v002 ISA memory movement instruction. See the ISA reference for supported source and destination paths.
- MEMSET instruction
v002 ISA instruction used to write shape or constant-table state rather than to run arithmetic.
- CVO instruction
v002 ISA instruction that dispatches an SFU function over a vector or reduction operand.
- HAL
Hardware Abstraction Layer. The C/C++ driver layer that wraps register, memory, and instruction-dispatch details for host software.
- Sail
ISA-specification language used by the pccx formal model. In pccx docs, Sail models are used to check instruction semantics and field widths against the intended ISA structure.
- launcher contract
Data-only interface between the planned KV260 runtime path and launcher software. A contract page describes shapes and guardrails; it is not board execution evidence.
- readiness scaffold
Typed placeholder or adapter surface that makes a future hardware path reviewable before device access is implemented.
- AXI command/status shapes
Launcher-side data structures for command and status exchange over the future KV260 boundary. Shape validation is contract evidence, not a live MMIO run.
- result streaming
Runtime path for returning generated tokens or accelerator results. Public docs should distinguish mock streams, serial test framing, and captured board streams.
- serial TTY
Character-device path used by launcher or lab tooling to exchange framed records with a connected target. Tests that skip without a device are not board evidence.
- TraceStream
pccx-lab iterator contract for trace records. File replay and serial TTY sources can share this surface while still having different evidence status.
KVFPGA_TTYEnvironment or configuration path naming the serial device used by the KV260 trace source.
- newline JSON framing
Trace framing style where one JSON payload is carried per line between begin/end markers.
- CRC
Cyclic redundancy check. In pccx-lab trace framing docs it is used to detect corrupted payloads; skipped bad frames should not be counted as valid hardware evidence.
- sequence gap
Missing trace-frame sequence number reported by the lab pipeline. It is a diagnostic signal that the captured stream may be incomplete.
Memory And Model Terms¶
- L1
Local per-core memory or buffer close to a compute backend.
- L2
Shared on-chip cache in the v002 architecture. It is backed by URAM and is shared by GEMM, GEMV, SFU, and memory-dispatch paths.
- Weight Buffer
On-chip FIFO/buffer path for model weights arriving from external memory. GEMM uses it for weight preload/reuse; GEMV uses it for streaming.
- KV cache
Attention key/value storage retained across autoregressive decoding steps. pccx docs distinguish KV-cache design targets from measured board capacity or throughput claims.
- Attention Sink
KV-cache policy term for retaining the first tokens of a prompt while using a sliding local window for recent tokens.
- Local Window
KV-cache policy term for the recent-token region retained during long-context decoding.
- RoPE
Rotary Position Embedding. pccx maps RoPE-related sine and cosine work to CVO operations in the SFU path.
- RMSNorm
Root Mean Square Layer Normalization. In pccx docs this is one of the non-linear or reduction-heavy operations associated with the SFU path.
- Softmax
Normalization used in attention. pccx docs map its exponential, reduction, reciprocal, and scale steps to CVO/SFU operations.
- GELU
Gaussian Error Linear Unit activation. pccx docs map GELU to the CVO/SFU path.
- Gemma 3N E4B
Target LLM family named in the v002 public docs. Claims about token rate or board execution remain evidence-gated unless the page cites published verification data.
- GemmaArchSpec
Launcher-side configuration object for Gemma shape metadata and packed-size checks. It is a spec-validation surface, not a model execution claim.
- W4 prep
Launcher-side preparation of signed W4 packed weights and related metadata. Current docs treat it as a software contract until hardware handoff evidence lands.
- manifest metadata
Structured metadata that records prepared weight shapes, scales, packed sizes, or related handoff fields for the launcher path.
- tokenizer contract
Offline tokenizer interface used by the launcher scaffold. Placeholder fixtures do not claim real Gemma tokenizer assets.
- token streaming
Movement of prompt or generated-token data across a runtime boundary. In the current software-path docs, serial and mock streaming are scaffold evidence until board captures are published.
- marker-wrapped chunks
Token-transport records delimited by explicit markers, sometimes with length prefixes. They define framing behavior rather than hardware throughput.
- mock orchestration
End-to-end software path that joins prompt encode, W4 prep, mock command polling, output receive, and decode without a real board run.
- AltUp
Gemma-specific multi-stream state item named in v002.1 FAQ material. Its effect on throughput or memory pressure still needs measured evidence before public claims.
- LAuReL
Gemma-specific mechanism named in model and FAQ pages. Public docs may describe the mapping, but speedup or accuracy claims need evidence.
- PLE
Per-Layer Embedding mechanism referenced by Gemma model docs. Treat PLE-related scheduling text as design mapping unless an evidence page links a measurement.
- grouped-query attention
Attention variant that shares key/value projections across query groups. pccx docs discuss it as part of the Gemma mapping and KV-cache traffic budget.
- cross-layer KV sharing
Gemma-specific KV reuse pattern that affects cache residency and traffic. Public docs should keep it separate from measured throughput claims.
- EAGLE-3
Speculative-decoding technique named in the v002.1 roadmap scope. In this repo it is planned work, not a completed v002.0 feature.
- SSD
Speculative-decoding roadmap item in the v002.1 scope. Expand or redefine the acronym at the point of use when adding detailed public documentation.
- J Tree
Roadmap shorthand associated with the v002.1 speculative-decoding stack. Treat it as planned scope until a design page defines and verifies it.
- G sparsity
Roadmap lane for v002.1 sparsity work. It should be described as ramp scope until implementation and evidence pages say more.
- H/H+
Roadmap shorthand for EAGLE-3 speculative-decoding phases in the v002.1 ramp.
- I SSD
Roadmap shorthand for the SSD phase in the v002.1 ramp.
- K benchmark
Roadmap shorthand for benchmark/evidence work after the v002.1 mechanism lanes. Benchmarks become public claims only through the evidence gates.
Metrics And Evidence¶
- tok/s
Tokens per second. pccx uses this as the primary user-visible decoding throughput unit.
- TT
Throughput target. This is planning shorthand for a target token rate, not a measurement. Public pages should prefer spelling out “throughput target” on first use.
- measured-only
Documentation posture for the v002.0 release line: do not quote throughput, timing closure, or board-run claims until the evidence checklist admits those measurements.
- bring-up
Hardware integration phase where the bitstream, board setup, host driver, and smoke tests are made to run together. Bring-up logs are evidence inputs, not automatically release claims.
- release evidence
Checklist-gated artifacts used to decide whether timing, throughput, or board-execution statements are allowed in public docs.
- evidence inventory
Public list of measured, reproducible artefacts and pending gates. It is the place to check whether a value is measured, pending, or only a target.
- claim guard
Review rule or scan that prevents public docs from turning targets, scaffolds, mocks, or pending gates into completed hardware claims.
- pre-flight
Preparatory state for build, launcher, or deploy work before the full command sequence has been run and evidence has been captured.
- smoke capture
Small board or tool run used to collect initial logs. It can support bring-up evidence, but it does not replace release evidence for timing or throughput.
- timing report
Vivado report used to justify timing wording. A docs page should not claim timing closure without a linked report or release evidence entry.
- utilization report
Vivado report used to justify FPGA resource wording such as DSP, LUT, BRAM, or URAM counts.
- throughput target
Planned token-rate goal. It must remain distinct from measured throughput in public wording.
- board run
Execution against a connected KV260 or other named target board. Mock tests, type checks, and local software orchestration are not board runs.
- trace replay
Analysis of an existing
.pccxtrace file through pccx-lab tooling. Replay can validate analysis paths without proving new hardware execution.
Documentation And Release Terms¶
- spec resolution
Reader step that separates architecture intent, model mapping, ISA source of truth, and measured evidence before quoting a claim.
- runbook
Step-by-step command record for a build, local docs check, deploy, or hardware procedure. A runbook is procedure evidence only after the commands and results are captured.
- deploy runbook
Documentation path for publishing the Sphinx site through GitHub Pages. A deploy check proves publication, not hardware performance.
- release status
Label such as draft, prerelease, latest release, or archived release used by release notes. It should not be overloaded with hardware readiness.
- pre-release
GitHub Release state for work that is published before being treated as a final release.
- validation status
Release-note field that records which checks passed, failed, or were not run. It should name commands or CI runs where useful.
- known limitations
Release-note section for caveats, missing evidence, or deferred capability.
- release checklist
Maintainer checklist for release hygiene. For pccx ISA PDF changes, the checklist includes rebuilding the PDF from
main.tex.- GitHub Pages deploy
Publication workflow for the documentation site. Passing deploy does not convert a target, mock, or pending gate into measured evidence.
- contributors acknowledgement
Public recognition of people who contribute documentation, reviews, bug reports, diagrams, examples, or related code after maintainers accept the entry for publication.
- news section
Placeholder area for future project updates, release announcements, and community news. It should not carry release claims without the same evidence gates as the rest of the docs.