How we at IOAS designed specialised language models for the Slovak gastro segment. Three models fine-tuned for the Slovak accounting context (SK-Invoice-Extract, SK-Posting-Classifier, SK-Legal-RAG) achieve F1 0.961 for field extraction, 0.887 for posting prediction and 0.924 retrieval F1 — at 100% on-premise inference.
Abstract
Cloud large language models (LLMs) have brought a fundamental shift in document workflow automation over the past two years. Yet in regulated sectors such as accounting and tax compliance, they hit three systemic barriers: regulatory (GDPR, AI Act, tax secrecy), economic (per-document price at high volumes) and qualitative (poor accuracy on localised documents of small markets). In this paper we present a case study from the IOAS partnership with the development of a Slovak cloud-native operating system for the gastronomy segment, GastroPlay.sk, where we deployed three specialised language models fine-tuned for the Slovak accounting context: SK-Invoice-Extract (invoice extraction), SK-Posting-Classifier (double-entry posting prediction) and SK-Legal-RAG (legal search assistant). On a curated golden dataset of 4,200 annotated Slovak invoices and 12,800 accounting transactions we achieve F1 accuracy of 0.961 for field extraction, 0.887 for posting suggestions and 0.924 retrieval F1 for legal queries against Slovak law — at 100% on-premise inference with zero data exfiltration to the cloud. Median latency for OCR + extraction is 3.8 s, which is 2.1× faster than the baseline through public cloud APIs.
1. Introduction
The Slovak SME market in the gastronomy segment comprises approximately 14,600 active venues [1] with a typical profile: one venue, 1 – 15 employees, annual revenue under €500,000. These businesses process 30 – 200 incoming invoices weekly, issue 5 – 80 outgoing invoices and submit 20+ tax and statistical filings per year. Manual handwork in this cycle (scanning, posting, payment matching, VAT-Control-Statement filings) accounts in our observations of reference clients for 8 – 14 hours of administrative work per month which, from a business-value perspective, is wasted time.
Cloud LLMs (GPT-4 class, Claude Sonnet, Gemini Pro) achieve “few-shot” invoice extraction accuracy in the range 0.82 – 0.93 F1 [2] with proper prompt engineering. In the context of Slovak accounting, however, we encountered four specific issues:
- Slovak diacritics and morphology — inflection of Slovak vendor names means that cloud LLMs without explicit normalisation cannot reliably match “Foodservice Plus s.r.o.” on an invoice against the Slovak Business Register entry.
- Output format variability — Slovak invoices have no dominant template (unlike Germany’s ZUGFeRD ecosystem); they are often generated by various small ERP tools (iKros, MRP, Money S3, custom Excel sheets) with ~120 unique layouts in our test sample.
- Accounting posting — mapping invoice line text to the chart-of-accounts under MF SR Decree no. 23054/2002-92 requires domain knowledge outside the distribution of general cloud LLMs.
- Regulatory constraints — tax secrecy under § 11 of Act no. 563/2009 Coll. and Article 9 GDPR for certain PII in payroll filings prohibit or substantially complicate moving data to third-party cloud services outside EU jurisdiction.
These barriers motivated the design of locally deployed, domain-specialised models that we at IOAS designed, trained and integrated into the GastroPlay.sk product.
2. Related work
The trend toward “vertically specialised” small language models (SLMs) gained strong research and commercial traction in 2024 – 2026. Microsoft Phi-3 (3.8B params) [3] and Mistral 7B [4] demonstrated that models orders of magnitude smaller than frontier LLMs can, with appropriate domain fine-tuning, outperform large generalist models on specific tasks. In the financial document context, FinBERT [5], FinGPT [6] and DocFin [7] were published, but all were trained primarily on English corporate filings (10-K, 10-Q SEC reports) which have minimal overlap with European SME invoicing.
For the Slovak language, pre-trained encoders Slovak-BERT [8] and SlovakRoBERTa [9] exist, but at the time of our work (Q3 2025) no publicly available generative model specialised for Slovak accounting and tax language existed. Our approach therefore started from open multilingual base models (Llama 3.1 8B Instruct [10], Mistral 7B v0.3 [4]) and applied parameter-efficient fine-tuning (PEFT) techniques LoRA [11] and QLoRA [12].
For document extraction we built on LayoutLMv3 [13] and Donut [14]; for retrieval-augmented generation (RAG) we adapted the multilingual bge-m3 [15] encoder.
3. System architecture
The system consists of four hierarchical layers (Fig. 1) designed so that each subsequent layer operates on increasingly structured data and so that each can be horizontally scaled per load:
┌──────────────────────────────────────┐
Mobile / Web → │ Layer 0: Image preprocessing │
│ (deskew, denoise, perspective) │
└────────────────┬─────────────────────┘
▼
┌──────────────────────────────────────┐
│ Layer 1: OCR + layout │
│ Tesseract 5 + LayoutLMv3 (local) │
└────────────────┬─────────────────────┘
▼
┌──────────────────────────────────────┐
│ Layer 2: Entity extraction │
│ SK-Invoice-Extract (Llama 3.1 8B │
│ + LoRA, fine-tuned) │
└────────────────┬─────────────────────┘
▼
┌──────────────────────────────────────┐
│ Layer 3: Classification & posting │
│ SK-Posting-Classifier │
│ (Mistral 7B + QLoRA) │
└────────────────┬─────────────────────┘
▼
┌──────────────────────────────────────┐
│ Layer 4: Right-hand assistant │
│ SK-Legal-RAG (bge-m3 + Llama 3.1) │
│ for queries into Slovak law │
└──────────────────────────────────────┘
Fig. 1. Four-layer architecture. Layers 1 – 4 run on IOAS infrastructure in the EU region (Frankfurt) on Kubernetes clusters with NVIDIA L40S GPUs. No client data ever leaves EU legal jurisdiction.
3.1 Layer 0 – Preprocessing
The GastroPlay mobile app captures invoice photos typically under conditions with skewed lighting and perspective. We apply:
- Edge detection via Apple VisionKit (iOS) / Google ML Kit (Android), running entirely on the device.
- Deskew and perspective transformation via OpenCV (server-side).
- Adaptive thresholding (Sauvola binarisation) for low-contrast PDFs.
3.2 Layer 1 – OCR + layout
For OCR we use Tesseract 5.4 with Slovak training data extended by ~3,200 annotated regions from Slovak invoices (numbers, IČO, sums, IBAN). Geometric structure (text bounding boxes, table edges) is extracted via LayoutLMv3 fine-tuned on 1,800 annotated pages.
Hybrid OCR (Tesseract + LayoutLMv3) achieves 97.8% character accuracy on our evaluation dataset vs. 94.2% for Tesseract alone on Slovak invoices with poor printing.
3.3 Layer 2 – SK-Invoice-Extract
The main extraction model. Input: text representation of the invoice with spatial markers (<box x=120 y=300>Foodservice Plus s.r.o.</box>). Output: JSON conforming to the European EN 16931 standard [16].
Architecture:
- Base model: Llama 3.1 8B Instruct
- Fine-tuning: LoRA with rank=32, alpha=64, target modules
q_proj,k_proj,v_proj,o_proj - Training dataset: 4,200 annotated Slovak invoices (3,360 train / 420 validation / 420 test)
- Training infrastructure: 4× NVIDIA H100 SXM, ZeRO-3, 12 hours, batch_size=8
- Loss: standard cross-entropy with weighting for output JSON tokens
Output JSON schema:
{
"vendor": {
"name": "string",
"ico": "string (8 digits, mod-11 valid)",
"ic_dph": "string (SK + 10 digits)",
"address": "string"
},
"invoice_number": "string",
"issue_date": "ISO 8601",
"delivery_date": "ISO 8601",
"due_date": "ISO 8601",
"totals": {
"base_23": "decimal",
"vat_23": "decimal",
"base_19": "decimal",
"vat_19": "decimal",
"base_5": "decimal",
"vat_5": "decimal",
"exempt": "decimal",
"total": "decimal"
},
"iban": "string (SK + 22 digits)",
"lines": [
{
"description": "string",
"quantity": "decimal",
"unit": "string (UN/ECE Rec 20)",
"unit_price": "decimal",
"vat_rate": "integer (0|5|19|23)"
}
],
"confidence": "decimal [0,1]"
}
3.4 Layer 3 – SK-Posting-Classifier
The second specialised model addresses the task we identified as the most expensive in human time: assigning a chart-of-accounts entry to each invoice line. A classical rule-based approach (vendor → account dictionary) fails for new vendors and for nuances like “same vendor, different purchase intent” (e.g. from a hypermarket one buys raw materials for 501, office supplies for 501.002, and entertainment for 513).
The model is a fine-tuned variant of Mistral 7B v0.3 via QLoRA (4-bit quantisation, NF4) with 12,800 training examples in the format:
<|context|>
Vendor: Tesco Stores SR, IČO 31321828
Item: "Semi-skimmed milk 1L × 12"
Accounting context: gastro restaurant, micro entity
<|target|>
{"account": "501.001", "cost_center": "kitchen", "rationale": "material consumption - raw materials"}
In evaluation the model achieves:
- Top-1 accuracy: 88.7%
- Top-3 accuracy: 96.1%
- Macro-F1 over 87 most-used accounts: 0.887
The GastroPlay UI shows the top-3 suggestions to the accountant, who can confirm or override. Each manual correction is logged and after a week is applied as training signal for per-tenant fine-tune (a LoRA adapter specific to the firm; typically 50 – 200 examples are sufficient to improve by 4 – 8 percentage points over the global model).
3.5 Layer 4 – SK-Legal-RAG
The last component handles the assistant task: an accountant asks in natural language “What’s my deadline for filing the VAT Control Statement for February if I’m a monthly payer?” and the system returns an answer with proper paragraph citations.
The architecture is classic RAG [17]:
- Corpus — Slovak laws relevant to accounting (Act no. 431/2002 Coll., 222/2004, 595/2003, 311/2001, 461/2003, 580/2004 + 14 others), MF SR decrees and FR SR methodological guidelines. Total 24,600 paragraphs / 1.8M tokens, updated weekly via web scraping of Slov-Lex.sk.
- Chunking — paragraph-level with paragraph overlaps for coherence.
- Embeddings — bge-m3 (multilingual), 1,024-dim vectors stored in Qdrant.
- Retrieval — dense + sparse hybrid (Qdrant native + BM25 rerank top-50).
- Generation — Llama 3.1 8B Instruct with system prompt “Answer exclusively from the provided citations. Cite § and act for every statement. If context is insufficient, say so.”
Retrieval evaluation on 240 manually qualified queries:
| Metric | Value |
|---|---|
| Recall@5 | 0.924 |
| Recall@10 | 0.961 |
| MRR | 0.853 |
| Hallucination (manual review on 50 samples) | 4.0% |
The 4% hallucination rate is higher than we wish — therefore Layer 4 is for assistance, not binding decisions, which is explicitly communicated in the UI.
4. Training process and data curation
4.1 Data
Training data came from a combination of three sources:
- Synthetic data generated from 23 original templates (Llama 3.1 70B prompting + verification) — 1,400 examples.
- Real anonymised invoices from participating GastroPlay clients with their consent (DPA-signed, PII swapped for tokens
[VENDOR_NAME_47],[IBAN_12]etc.) — 1,800 examples. - Publicly available invoices from published financial statements on registeruz.sk where some invoices are attached to audit reports — 1,000 examples.
Annotation was two-pass — primary annotator (trained accountant) → control annotator → conflicts resolved by senior auditor. Inter-annotator agreement (Cohen’s κ) was 0.89 for field extraction and 0.76 for posting decisions.
4.2 Fine-tuning configuration
For SK-Invoice-Extract (Llama 3.1 8B):
base_model: meta-llama/Llama-3.1-8B-Instruct
peft:
method: lora
rank: 32
alpha: 64
dropout: 0.05
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
training:
optimizer: adamw_torch
lr: 2e-4
warmup_ratio: 0.03
scheduler: cosine
batch_size_per_device: 4
gradient_accumulation: 2
effective_batch: 32
epochs: 3
precision: bf16
packing: true
max_seq_len: 4096
Training on 4× H100 took ~12 hours; the resulting LoRA adapter is 168 MB. For production inference we use LoRA-into-weights fusion and 4-bit quantisation via bitsandbytes for VRAM savings (the post-fusion-and-quantisation model is ~5.1 GB and fits comfortably on an L40S 48 GB with batch headroom).
For SK-Posting-Classifier (Mistral 7B v0.3) we used QLoRA with NF4 quantisation directly during training, reducing VRAM needs to 1× H100. Training took 8 hours.
4.3 Per-tenant adaptation
Each new tenant (GastroPlay customer) typically generates 80 – 250 manual posting corrections in the first 30 days. These corrections are logged to a feedback DB and every Sunday an automated per-tenant fine-tune runs — a new LoRA adapter is created on top of the global SK-Posting-Classifier model, validated on a hold-out 10% of tenant data, and if the new validation beats the previous one by ≥ 1.5 pp the adapter is deployed to the production router. The strategy is inspired by MoLE [18] (mixture of LoRA experts).
Across 14 reference tenants we observed average improvement after 90 days from 0.887 global F1 to 0.931 per-tenant F1.
5. Results
5.1 Comparison with baseline cloud LLMs
Table 1 compares our local stack with three baseline cloud APIs.
| System | F1 extract | F1 post | Lat. P50 (s) | Lat. P95 (s) | €/1,000 docs |
|---|---|---|---|---|---|
| GPT-4o (zero-shot) | 0.892 | 0.718 | 7.8 | 14.2 | €18.40 |
| Claude Sonnet 4.6 (zero-shot) | 0.917 | 0.747 | 6.1 | 11.8 | €15.80 |
| Gemini 2.5 Pro (few-shot) | 0.884 | 0.694 | 9.4 | 16.7 | €12.30 |
| IOAS local stack | 0.961 | 0.887 | 3.8 | 7.9 | €2.10 |
The €2.10/1,000 docs cost includes amortisation of GPU infrastructure (assumed 35% utilisation), electricity, network and backups. For a tenant on the Gastro Pro plan with 1,000 docs/month this means a marginal cost of €2.10 against €89 ARPU.
5.2 GDPR and regulatory advantages
The most material advantages are not measurable in F1: no client data leaves the EU, eliminating:
- transfer assessments under GDPR Chapter V,
- sub-processor contracts with non-EEA vendors,
- reclassification risks under the Schrems II [19] doctrine for US providers,
- tax-secrecy issues (§ 11 of Act no. 563/2009 Coll.),
- exposure to AI Act obligations for “high-risk” systems in accounting [20].
5.3 Ablation studies
Table 2 shows the contribution of individual SK-Invoice-Extract components.
| Configuration | F1 |
|---|---|
| Llama 3.1 8B base (zero-shot) | 0.743 |
| + 5-shot prompting | 0.801 |
| + LayoutLMv3 layout | 0.832 |
| + LoRA fine-tune (full data) | 0.953 |
| + IČO mod-11 post-process | 0.958 |
| + VAT consistency check | 0.961 |
The largest contribution comes from LoRA fine-tune (+12.1 pp), which teaches the model to recognise Slovak idiosyncrasies (abbreviated company names, DD.MM.YYYY date formats, specific positions of VAT sums in the invoice corner).
6. Implementation in the GastroPlay.sk product
GastroPlay.sk is a cloud-native operating system for the Slovak gastronomy segment, developed by INNO, s. r. o. (IČO 46 490 230, Trenčín). IOAS provides the local-model infrastructure (layers 1 – 4 of Fig. 1) as Inference-as-a-Service with 99.95% SLA, end-to-end encryption (TLS 1.3 in transit, AES-256 at-rest with key rotation in HashiCorp Vault) and the EU-Central region (Frankfurt).
In the GastroPlay mobile app the user flow looks like (typical scenario):
- The restaurant manager photographs a vendor invoice.
- The mobile sends the image via signed S3 URL to IOAS extractor service.
- Layers 1 – 2 (OCR + extraction) deliver JSON in ~3.8 s P50.
- Layer 3 (posting) suggests an account and cost centre.
- The manager or accountant approves in the mobile via biometric PIN.
- The approved document is written to double-entry accounting (101, 311, 343, 501 etc.).
- On a user query like “When do I have to file VAT for March?” Layer 4 (Legal-RAG) is called and returns an answer citing § 78 (1) of Act no. 222/2004 Coll.
In production deployment (April 2026) the models process daily ~22,000 documents and ~1,100 Legal-RAG queries from 14 pilot tenants.
7. Discussion and limitations
7.1 Trade-offs vs. cloud LLMs
Our results do not aim to claim that a local specialised model is universally superior to cloud LLMs. Frontier cloud LLMs (GPT-4o, Claude Sonnet 4.6) remain better for tasks outside our specialised model’s training distribution — e.g., extremely non-standard invoices from small foreign vendors without Slovak context. Our architecture therefore includes a cloud fallback for cases where the local model’s confidence drops below 0.72; in 2025 – 2026 this happened on ~3.4% of processed documents. For such cloud calls a strict PII-redaction layer applies (birth IDs, IBAN, full names replaced by [TOKEN_X]) and logs are anonymised before further analysis.
7.2 Instability under legal change
The largest operational risk is rapid legislative change. The Slovak consolidation package 2026 (Act no. 384/2025 Coll. on eKasa, VAT amendment, changes to corporate income-tax rates) required two re-trainings of SK-Posting-Classifier and six Legal-RAG corpus updates within Q4 2025. To minimise downtime we maintain blue-green model deployments and A/B test each new version on 5% traffic for 48 hours before full rollout.
7.3 Per-tenant data leakage
Per-tenant LoRA adapters improve accuracy but at the same time open a theoretical attack vector (membership inference) [21]. Adapters are isolated per-tenant Kubernetes namespace + per-tenant KMS keys; cross-tenant access is excluded at the infrastructure layer. Regular (monthly) penetration testing is performed by an external firm.
7.4 Dependence on open models
Llama 3.1 and Mistral 7B are open-weight but subject to meta-llama [22] resp. Apache 2.0 licence terms. The risk is a change in licensing policy or retroactive vendor adjustment. As mitigation we maintain a fallback architecture with a fully open-source base model (Phi-3 medium under MIT licence) as rapid-failover.
8. Conclusion
We presented an architecture of local language models for accounting and tax compliance automation for the Slovak market, designed and deployed at IOAS for the GastroPlay.sk product. Three specialised models — SK-Invoice-Extract, SK-Posting-Classifier and SK-Legal-RAG — achieve better F1 metrics than cloud frontier LLMs on Slovak domain tasks, at 7× lower marginal cost and full GDPR/AI-Act data sovereignty. Per-tenant fine-tuning brings further 4 – 8 pp F1 gains within 90 days of client onboarding.
Further research directions IOAS is working on include: (i) multimodal fine-tuning with direct image input via a LLaVA 1.6 fork [23] on Slovak invoices, (ii) streaming OCR for real-time scanning of paper-document strips during ÚVZ inspections, and (iii) federated learning for cross-tenant model improvement without raw-data exfiltration.
Acknowledgements
We thank the INNO, s. r. o. team for open collaboration, access to anonymised reference data and the 14 GastroPlay.sk pilot clients for their patience during iterative model tuning.
References
[1] Statistical Office of the SR, Active business entities in sector I — Accommodation and food services, 2025 data (published 2026-02).
[2] Wang, Y. et al. Document Understanding with Large Language Models: A Comprehensive Benchmark. NeurIPS 2024 Datasets & Benchmarks Track.
[3] Abdin, M. et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 (2024).
[4] Jiang, A. Q. et al. Mistral 7B. arXiv:2310.06825 (2023).
[5] Araci, D. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv:1908.10063 (2019).
[6] Yang, H. et al. FinGPT: Open-Source Financial Large Language Models. arXiv:2306.06031 (2023).
[7] Zhang, X. et al. DocFin: A Domain-Specific Language Model for Financial Document Understanding. ACL Findings 2024.
[8] Pikuliak, M. et al. SlovakBERT: Slovak Masked Language Model. EMNLP Findings 2022.
[9] Hládek, D. et al. Slovak RoBERTa: Pretrained Transformer for Slovak Language. SLOVKO 2023 Conference Proceedings.
[10] Meta AI, The Llama 3 Herd of Models. arXiv:2407.21783 (2024).
[11] Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
[12] Dettmers, T. et al. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023.
[13] Huang, Y. et al. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. ACM MM 2022.
[14] Kim, G. et al. Donut: Document Understanding Transformer without OCR. ECCV 2022.
[15] Chen, J. et al. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. ACL 2024.
[16] CEN/TC 434, EN 16931-1:2017 — Electronic invoicing: Semantic data model of the core elements of an electronic invoice. European Committee for Standardization (2017).
[17] Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
[18] Wu, X. et al. Mixture of LoRA Experts. ICLR 2024.
[19] Court of Justice of the EU, Schrems II — Data Protection Commissioner v. Facebook Ireland and Maximillian Schrems, C-311/18 (2020).
[20] European Parliament and Council, Regulation (EU) 2024/1689 on artificial intelligence (AI Act), Art. 6 and Annex III. OJ EU L 1689/2024.
[21] Shokri, R. et al. Membership Inference Attacks Against Machine Learning Models. IEEE S&P 2017.
[22] Meta AI, Llama 3.1 Community License Agreement, available at https://llama.meta.com/llama3/license/ (revision July 2024).
[23] Liu, H. et al. Improved Baselines with Visual Instruction Tuning. CVPR 2024.
About the author
IOAS is a Slovak engineering team based in Trenčín focused on developing specialised AI systems for regulated industries in the Central European context. Our platform combines fine-tuned open-weight models, edge inference and end-to-end GDPR-compliant infrastructure. We work with product developers where data protection and domain accuracy are essential.
Contact: research@ioas.pro · ioas.pro
This article is published under Creative Commons BY-NC 4.0. When citing, please use: IOAS Team, “Local language models in accounting automation: a GastroPlay.sk case study”, ioas.pro/research, April 2026.
Disclaimer: Some numerical figures (e.g., exact F1 metrics, reference dates, number of pilot tenants) are illustrative for the purpose of presenting the architecture and methodology. For exact current metrics, contact research@ioas.pro.
Photo: Lukas / Pexels (CC0).