[{"data":1,"prerenderedAt":2112},["ShallowReactive",2],{"/en-us/blog/tags/performance/":3,"navigation-en-us":19,"banner-en-us":437,"footer-en-us":452,"performance-tag-page-en-us":663},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"content":8,"config":10,"_id":12,"_type":13,"title":14,"_source":15,"_file":16,"_stem":17,"_extension":18},"/en-us/blog/tags/performance","tags",false,"",{"tag":9,"tagSlug":9},"performance",{"template":11},"BlogTag","content:en-us:blog:tags:performance.yml","yaml","Performance","content","en-us/blog/tags/performance.yml","en-us/blog/tags/performance","yml",{"_path":20,"_dir":21,"_draft":6,"_partial":6,"_locale":7,"data":22,"_id":433,"_type":13,"title":434,"_source":15,"_file":435,"_stem":436,"_extension":18},"/shared/en-us/main-navigation","en-us",{"logo":23,"freeTrial":28,"sales":33,"login":38,"items":43,"search":374,"minimal":405,"duo":424},{"config":24},{"href":25,"dataGaName":26,"dataGaLocation":27},"/","gitlab logo","header",{"text":29,"config":30},"Get free trial",{"href":31,"dataGaName":32,"dataGaLocation":27},"https://gitlab.com/-/trial_registrations/new?glm_source=about.gitlab.com&glm_content=default-saas-trial/","free trial",{"text":34,"config":35},"Talk to sales",{"href":36,"dataGaName":37,"dataGaLocation":27},"/sales/","sales",{"text":39,"config":40},"Sign in",{"href":41,"dataGaName":42,"dataGaLocation":27},"https://gitlab.com/users/sign_in/","sign in",[44,88,184,189,295,355],{"text":45,"config":46,"cards":48,"footer":71},"Platform",{"dataNavLevelOne":47},"platform",[49,55,63],{"title":45,"description":50,"link":51},"The most comprehensive AI-powered DevSecOps Platform",{"text":52,"config":53},"Explore our Platform",{"href":54,"dataGaName":47,"dataGaLocation":27},"/platform/",{"title":56,"description":57,"link":58},"GitLab Duo (AI)","Build software faster with AI at every stage of development",{"text":59,"config":60},"Meet GitLab Duo",{"href":61,"dataGaName":62,"dataGaLocation":27},"/gitlab-duo/","gitlab duo ai",{"title":64,"description":65,"link":66},"Why GitLab","10 reasons why Enterprises choose GitLab",{"text":67,"config":68},"Learn more",{"href":69,"dataGaName":70,"dataGaLocation":27},"/why-gitlab/","why gitlab",{"title":72,"items":73},"Get started with",[74,79,84],{"text":75,"config":76},"Platform Engineering",{"href":77,"dataGaName":78,"dataGaLocation":27},"/solutions/platform-engineering/","platform engineering",{"text":80,"config":81},"Developer Experience",{"href":82,"dataGaName":83,"dataGaLocation":27},"/developer-experience/","Developer experience",{"text":85,"config":86},"MLOps",{"href":87,"dataGaName":85,"dataGaLocation":27},"/topics/devops/the-role-of-ai-in-devops/",{"text":89,"left":90,"config":91,"link":93,"lists":97,"footer":166},"Product",true,{"dataNavLevelOne":92},"solutions",{"text":94,"config":95},"View all Solutions",{"href":96,"dataGaName":92,"dataGaLocation":27},"/solutions/",[98,123,145],{"title":99,"description":100,"link":101,"items":106},"Automation","CI/CD and automation to accelerate deployment",{"config":102},{"icon":103,"href":104,"dataGaName":105,"dataGaLocation":27},"AutomatedCodeAlt","/solutions/delivery-automation/","automated software delivery",[107,111,115,119],{"text":108,"config":109},"CI/CD",{"href":110,"dataGaLocation":27,"dataGaName":108},"/solutions/continuous-integration/",{"text":112,"config":113},"AI-Assisted Development",{"href":61,"dataGaLocation":27,"dataGaName":114},"AI assisted development",{"text":116,"config":117},"Source Code Management",{"href":118,"dataGaLocation":27,"dataGaName":116},"/solutions/source-code-management/",{"text":120,"config":121},"Automated Software Delivery",{"href":104,"dataGaLocation":27,"dataGaName":122},"Automated software delivery",{"title":124,"description":125,"link":126,"items":131},"Security","Deliver code faster without compromising security",{"config":127},{"href":128,"dataGaName":129,"dataGaLocation":27,"icon":130},"/solutions/security-compliance/","security and compliance","ShieldCheckLight",[132,135,140],{"text":133,"config":134},"Security & Compliance",{"href":128,"dataGaLocation":27,"dataGaName":133},{"text":136,"config":137},"Software Supply Chain Security",{"href":138,"dataGaLocation":27,"dataGaName":139},"/solutions/supply-chain/","Software supply chain security",{"text":141,"config":142},"Compliance & Governance",{"href":143,"dataGaLocation":27,"dataGaName":144},"/solutions/continuous-software-compliance/","Compliance and governance",{"title":146,"link":147,"items":152},"Measurement",{"config":148},{"icon":149,"href":150,"dataGaName":151,"dataGaLocation":27},"DigitalTransformation","/solutions/visibility-measurement/","visibility and measurement",[153,157,161],{"text":154,"config":155},"Visibility & Measurement",{"href":150,"dataGaLocation":27,"dataGaName":156},"Visibility and Measurement",{"text":158,"config":159},"Value Stream Management",{"href":160,"dataGaLocation":27,"dataGaName":158},"/solutions/value-stream-management/",{"text":162,"config":163},"Analytics & Insights",{"href":164,"dataGaLocation":27,"dataGaName":165},"/solutions/analytics-and-insights/","Analytics and insights",{"title":167,"items":168},"GitLab for",[169,174,179],{"text":170,"config":171},"Enterprise",{"href":172,"dataGaLocation":27,"dataGaName":173},"/enterprise/","enterprise",{"text":175,"config":176},"Small Business",{"href":177,"dataGaLocation":27,"dataGaName":178},"/small-business/","small business",{"text":180,"config":181},"Public Sector",{"href":182,"dataGaLocation":27,"dataGaName":183},"/solutions/public-sector/","public sector",{"text":185,"config":186},"Pricing",{"href":187,"dataGaName":188,"dataGaLocation":27,"dataNavLevelOne":188},"/pricing/","pricing",{"text":190,"config":191,"link":193,"lists":197,"feature":282},"Resources",{"dataNavLevelOne":192},"resources",{"text":194,"config":195},"View all resources",{"href":196,"dataGaName":192,"dataGaLocation":27},"/resources/",[198,231,254],{"title":199,"items":200},"Getting started",[201,206,211,216,221,226],{"text":202,"config":203},"Install",{"href":204,"dataGaName":205,"dataGaLocation":27},"/install/","install",{"text":207,"config":208},"Quick start guides",{"href":209,"dataGaName":210,"dataGaLocation":27},"/get-started/","quick setup checklists",{"text":212,"config":213},"Learn",{"href":214,"dataGaLocation":27,"dataGaName":215},"https://university.gitlab.com/","learn",{"text":217,"config":218},"Product documentation",{"href":219,"dataGaName":220,"dataGaLocation":27},"https://docs.gitlab.com/","product documentation",{"text":222,"config":223},"Best practice videos",{"href":224,"dataGaName":225,"dataGaLocation":27},"/getting-started-videos/","best practice videos",{"text":227,"config":228},"Integrations",{"href":229,"dataGaName":230,"dataGaLocation":27},"/integrations/","integrations",{"title":232,"items":233},"Discover",[234,239,244,249],{"text":235,"config":236},"Customer success stories",{"href":237,"dataGaName":238,"dataGaLocation":27},"/customers/","customer success stories",{"text":240,"config":241},"Blog",{"href":242,"dataGaName":243,"dataGaLocation":27},"/blog/","blog",{"text":245,"config":246},"Remote",{"href":247,"dataGaName":248,"dataGaLocation":27},"https://handbook.gitlab.com/handbook/company/culture/all-remote/","remote",{"text":250,"config":251},"TeamOps",{"href":252,"dataGaName":253,"dataGaLocation":27},"/teamops/","teamops",{"title":255,"items":256},"Connect",[257,262,267,272,277],{"text":258,"config":259},"GitLab Services",{"href":260,"dataGaName":261,"dataGaLocation":27},"/services/","services",{"text":263,"config":264},"Community",{"href":265,"dataGaName":266,"dataGaLocation":27},"/community/","community",{"text":268,"config":269},"Forum",{"href":270,"dataGaName":271,"dataGaLocation":27},"https://forum.gitlab.com/","forum",{"text":273,"config":274},"Events",{"href":275,"dataGaName":276,"dataGaLocation":27},"/events/","events",{"text":278,"config":279},"Partners",{"href":280,"dataGaName":281,"dataGaLocation":27},"/partners/","partners",{"backgroundColor":283,"textColor":284,"text":285,"image":286,"link":290},"#2f2a6b","#fff","Insights for the future of software development",{"altText":287,"config":288},"the source promo card",{"src":289},"/images/navigation/the-source-promo-card.svg",{"text":291,"config":292},"Read the latest",{"href":293,"dataGaName":294,"dataGaLocation":27},"/the-source/","the source",{"text":296,"config":297,"lists":299},"Company",{"dataNavLevelOne":298},"company",[300],{"items":301},[302,307,313,315,320,325,330,335,340,345,350],{"text":303,"config":304},"About",{"href":305,"dataGaName":306,"dataGaLocation":27},"/company/","about",{"text":308,"config":309,"footerGa":312},"Jobs",{"href":310,"dataGaName":311,"dataGaLocation":27},"/jobs/","jobs",{"dataGaName":311},{"text":273,"config":314},{"href":275,"dataGaName":276,"dataGaLocation":27},{"text":316,"config":317},"Leadership",{"href":318,"dataGaName":319,"dataGaLocation":27},"/company/team/e-group/","leadership",{"text":321,"config":322},"Team",{"href":323,"dataGaName":324,"dataGaLocation":27},"/company/team/","team",{"text":326,"config":327},"Handbook",{"href":328,"dataGaName":329,"dataGaLocation":27},"https://handbook.gitlab.com/","handbook",{"text":331,"config":332},"Investor relations",{"href":333,"dataGaName":334,"dataGaLocation":27},"https://ir.gitlab.com/","investor relations",{"text":336,"config":337},"Trust Center",{"href":338,"dataGaName":339,"dataGaLocation":27},"/security/","trust center",{"text":341,"config":342},"AI Transparency Center",{"href":343,"dataGaName":344,"dataGaLocation":27},"/ai-transparency-center/","ai transparency center",{"text":346,"config":347},"Newsletter",{"href":348,"dataGaName":349,"dataGaLocation":27},"/company/contact/","newsletter",{"text":351,"config":352},"Press",{"href":353,"dataGaName":354,"dataGaLocation":27},"/press/","press",{"text":356,"config":357,"lists":358},"Contact us",{"dataNavLevelOne":298},[359],{"items":360},[361,364,369],{"text":34,"config":362},{"href":36,"dataGaName":363,"dataGaLocation":27},"talk to sales",{"text":365,"config":366},"Get help",{"href":367,"dataGaName":368,"dataGaLocation":27},"/support/","get help",{"text":370,"config":371},"Customer portal",{"href":372,"dataGaName":373,"dataGaLocation":27},"https://customers.gitlab.com/customers/sign_in/","customer portal",{"close":375,"login":376,"suggestions":383},"Close",{"text":377,"link":378},"To search repositories and projects, login to",{"text":379,"config":380},"gitlab.com",{"href":41,"dataGaName":381,"dataGaLocation":382},"search login","search",{"text":384,"default":385},"Suggestions",[386,388,392,394,398,402],{"text":56,"config":387},{"href":61,"dataGaName":56,"dataGaLocation":382},{"text":389,"config":390},"Code Suggestions (AI)",{"href":391,"dataGaName":389,"dataGaLocation":382},"/solutions/code-suggestions/",{"text":108,"config":393},{"href":110,"dataGaName":108,"dataGaLocation":382},{"text":395,"config":396},"GitLab on AWS",{"href":397,"dataGaName":395,"dataGaLocation":382},"/partners/technology-partners/aws/",{"text":399,"config":400},"GitLab on Google Cloud",{"href":401,"dataGaName":399,"dataGaLocation":382},"/partners/technology-partners/google-cloud-platform/",{"text":403,"config":404},"Why GitLab?",{"href":69,"dataGaName":403,"dataGaLocation":382},{"freeTrial":406,"mobileIcon":411,"desktopIcon":416,"secondaryButton":419},{"text":407,"config":408},"Start free trial",{"href":409,"dataGaName":32,"dataGaLocation":410},"https://gitlab.com/-/trials/new/","nav",{"altText":412,"config":413},"Gitlab Icon",{"src":414,"dataGaName":415,"dataGaLocation":410},"/images/brand/gitlab-logo-tanuki.svg","gitlab icon",{"altText":412,"config":417},{"src":418,"dataGaName":415,"dataGaLocation":410},"/images/brand/gitlab-logo-type.svg",{"text":420,"config":421},"Get Started",{"href":422,"dataGaName":423,"dataGaLocation":410},"https://gitlab.com/-/trial_registrations/new?glm_source=about.gitlab.com/compare/gitlab-vs-github/","get started",{"freeTrial":425,"mobileIcon":429,"desktopIcon":431},{"text":426,"config":427},"Learn more about GitLab Duo",{"href":61,"dataGaName":428,"dataGaLocation":410},"gitlab duo",{"altText":412,"config":430},{"src":414,"dataGaName":415,"dataGaLocation":410},{"altText":412,"config":432},{"src":418,"dataGaName":415,"dataGaLocation":410},"content:shared:en-us:main-navigation.yml","Main Navigation","shared/en-us/main-navigation.yml","shared/en-us/main-navigation",{"_path":438,"_dir":21,"_draft":6,"_partial":6,"_locale":7,"title":439,"button":440,"image":444,"config":447,"_id":449,"_type":13,"_source":15,"_file":450,"_stem":451,"_extension":18},"/shared/en-us/banner","is now in public beta!",{"text":67,"config":441},{"href":442,"dataGaName":443,"dataGaLocation":27},"/gitlab-duo/agent-platform/","duo banner",{"config":445},{"src":446},"https://res.cloudinary.com/about-gitlab-com/image/upload/v1753720689/somrf9zaunk0xlt7ne4x.svg",{"layout":448},"release","content:shared:en-us:banner.yml","shared/en-us/banner.yml","shared/en-us/banner",{"_path":453,"_dir":21,"_draft":6,"_partial":6,"_locale":7,"data":454,"_id":659,"_type":13,"title":660,"_source":15,"_file":661,"_stem":662,"_extension":18},"/shared/en-us/main-footer",{"text":455,"source":456,"edit":462,"contribute":467,"config":472,"items":477,"minimal":651},"Git is a trademark of Software Freedom Conservancy and our use of 'GitLab' is under license",{"text":457,"config":458},"View page source",{"href":459,"dataGaName":460,"dataGaLocation":461},"https://gitlab.com/gitlab-com/marketing/digital-experience/about-gitlab-com/","page source","footer",{"text":463,"config":464},"Edit this page",{"href":465,"dataGaName":466,"dataGaLocation":461},"https://gitlab.com/gitlab-com/marketing/digital-experience/about-gitlab-com/-/blob/main/content/","web ide",{"text":468,"config":469},"Please contribute",{"href":470,"dataGaName":471,"dataGaLocation":461},"https://gitlab.com/gitlab-com/marketing/digital-experience/about-gitlab-com/-/blob/main/CONTRIBUTING.md/","please contribute",{"twitter":473,"facebook":474,"youtube":475,"linkedin":476},"https://twitter.com/gitlab","https://www.facebook.com/gitlab","https://www.youtube.com/channel/UCnMGQ8QHMAnVIsI3xJrihhg","https://www.linkedin.com/company/gitlab-com",[478,501,558,587,621],{"title":45,"links":479,"subMenu":484},[480],{"text":481,"config":482},"DevSecOps platform",{"href":54,"dataGaName":483,"dataGaLocation":461},"devsecops platform",[485],{"title":185,"links":486},[487,491,496],{"text":488,"config":489},"View plans",{"href":187,"dataGaName":490,"dataGaLocation":461},"view plans",{"text":492,"config":493},"Why Premium?",{"href":494,"dataGaName":495,"dataGaLocation":461},"/pricing/premium/","why premium",{"text":497,"config":498},"Why Ultimate?",{"href":499,"dataGaName":500,"dataGaLocation":461},"/pricing/ultimate/","why ultimate",{"title":502,"links":503},"Solutions",[504,509,512,514,519,524,528,531,535,540,542,545,548,553],{"text":505,"config":506},"Digital transformation",{"href":507,"dataGaName":508,"dataGaLocation":461},"/topics/digital-transformation/","digital transformation",{"text":133,"config":510},{"href":128,"dataGaName":511,"dataGaLocation":461},"security & compliance",{"text":122,"config":513},{"href":104,"dataGaName":105,"dataGaLocation":461},{"text":515,"config":516},"Agile development",{"href":517,"dataGaName":518,"dataGaLocation":461},"/solutions/agile-delivery/","agile delivery",{"text":520,"config":521},"Cloud transformation",{"href":522,"dataGaName":523,"dataGaLocation":461},"/topics/cloud-native/","cloud transformation",{"text":525,"config":526},"SCM",{"href":118,"dataGaName":527,"dataGaLocation":461},"source code management",{"text":108,"config":529},{"href":110,"dataGaName":530,"dataGaLocation":461},"continuous integration & delivery",{"text":532,"config":533},"Value stream management",{"href":160,"dataGaName":534,"dataGaLocation":461},"value stream management",{"text":536,"config":537},"GitOps",{"href":538,"dataGaName":539,"dataGaLocation":461},"/solutions/gitops/","gitops",{"text":170,"config":541},{"href":172,"dataGaName":173,"dataGaLocation":461},{"text":543,"config":544},"Small business",{"href":177,"dataGaName":178,"dataGaLocation":461},{"text":546,"config":547},"Public sector",{"href":182,"dataGaName":183,"dataGaLocation":461},{"text":549,"config":550},"Education",{"href":551,"dataGaName":552,"dataGaLocation":461},"/solutions/education/","education",{"text":554,"config":555},"Financial services",{"href":556,"dataGaName":557,"dataGaLocation":461},"/solutions/finance/","financial services",{"title":190,"links":559},[560,562,564,566,569,571,573,575,577,579,581,583,585],{"text":202,"config":561},{"href":204,"dataGaName":205,"dataGaLocation":461},{"text":207,"config":563},{"href":209,"dataGaName":210,"dataGaLocation":461},{"text":212,"config":565},{"href":214,"dataGaName":215,"dataGaLocation":461},{"text":217,"config":567},{"href":219,"dataGaName":568,"dataGaLocation":461},"docs",{"text":240,"config":570},{"href":242,"dataGaName":243,"dataGaLocation":461},{"text":235,"config":572},{"href":237,"dataGaName":238,"dataGaLocation":461},{"text":245,"config":574},{"href":247,"dataGaName":248,"dataGaLocation":461},{"text":258,"config":576},{"href":260,"dataGaName":261,"dataGaLocation":461},{"text":250,"config":578},{"href":252,"dataGaName":253,"dataGaLocation":461},{"text":263,"config":580},{"href":265,"dataGaName":266,"dataGaLocation":461},{"text":268,"config":582},{"href":270,"dataGaName":271,"dataGaLocation":461},{"text":273,"config":584},{"href":275,"dataGaName":276,"dataGaLocation":461},{"text":278,"config":586},{"href":280,"dataGaName":281,"dataGaLocation":461},{"title":296,"links":588},[589,591,593,595,597,599,601,605,610,612,614,616],{"text":303,"config":590},{"href":305,"dataGaName":298,"dataGaLocation":461},{"text":308,"config":592},{"href":310,"dataGaName":311,"dataGaLocation":461},{"text":316,"config":594},{"href":318,"dataGaName":319,"dataGaLocation":461},{"text":321,"config":596},{"href":323,"dataGaName":324,"dataGaLocation":461},{"text":326,"config":598},{"href":328,"dataGaName":329,"dataGaLocation":461},{"text":331,"config":600},{"href":333,"dataGaName":334,"dataGaLocation":461},{"text":602,"config":603},"Sustainability",{"href":604,"dataGaName":602,"dataGaLocation":461},"/sustainability/",{"text":606,"config":607},"Diversity, inclusion and belonging (DIB)",{"href":608,"dataGaName":609,"dataGaLocation":461},"/diversity-inclusion-belonging/","Diversity, inclusion and belonging",{"text":336,"config":611},{"href":338,"dataGaName":339,"dataGaLocation":461},{"text":346,"config":613},{"href":348,"dataGaName":349,"dataGaLocation":461},{"text":351,"config":615},{"href":353,"dataGaName":354,"dataGaLocation":461},{"text":617,"config":618},"Modern Slavery Transparency Statement",{"href":619,"dataGaName":620,"dataGaLocation":461},"https://handbook.gitlab.com/handbook/legal/modern-slavery-act-transparency-statement/","modern slavery transparency statement",{"title":622,"links":623},"Contact Us",[624,627,629,631,636,641,646],{"text":625,"config":626},"Contact an expert",{"href":36,"dataGaName":37,"dataGaLocation":461},{"text":365,"config":628},{"href":367,"dataGaName":368,"dataGaLocation":461},{"text":370,"config":630},{"href":372,"dataGaName":373,"dataGaLocation":461},{"text":632,"config":633},"Status",{"href":634,"dataGaName":635,"dataGaLocation":461},"https://status.gitlab.com/","status",{"text":637,"config":638},"Terms of use",{"href":639,"dataGaName":640,"dataGaLocation":461},"/terms/","terms of use",{"text":642,"config":643},"Privacy statement",{"href":644,"dataGaName":645,"dataGaLocation":461},"/privacy/","privacy statement",{"text":647,"config":648},"Cookie preferences",{"dataGaName":649,"dataGaLocation":461,"id":650,"isOneTrustButton":90},"cookie preferences","ot-sdk-btn",{"items":652},[653,655,657],{"text":637,"config":654},{"href":639,"dataGaName":640,"dataGaLocation":461},{"text":642,"config":656},{"href":644,"dataGaName":645,"dataGaLocation":461},{"text":647,"config":658},{"dataGaName":649,"dataGaLocation":461,"id":650,"isOneTrustButton":90},"content:shared:en-us:main-footer.yml","Main Footer","shared/en-us/main-footer.yml","shared/en-us/main-footer",{"allPosts":664,"featuredPost":2091,"totalPagesCount":2110,"initialPosts":2111},[665,693,717,739,766,788,808,830,850,871,890,911,932,955,975,994,1013,1032,1052,1072,1091,1112,1132,1152,1173,1194,1214,1235,1256,1275,1296,1316,1335,1356,1377,1397,1417,1437,1457,1476,1495,1516,1538,1557,1577,1597,1616,1636,1656,1675,1694,1714,1734,1754,1774,1794,1813,1832,1852,1871,1892,1910,1930,1949,1970,1990,2011,2030,2049,2071],{"_path":666,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":667,"content":675,"config":686,"_id":689,"_type":13,"title":690,"_source":15,"_file":691,"_stem":692,"_extension":18},"/en-us/blog/a-story-of-runner-scaling",{"title":668,"description":669,"ogTitle":668,"ogDescription":669,"noIndex":6,"ogImage":670,"ogUrl":671,"ogSiteName":672,"ogType":673,"canonicalUrls":671,"schema":674},"An SA story about hyperscaling GitLab Runner workloads using Kubernetes","It is important to have the complete picture of scaled effects in view when designing automation.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749669897/Blog/Hero%20Images/kaleidico-26MJGnCM0Wc-unsplash.jpg","https://about.gitlab.com/blog/a-story-of-runner-scaling","https://about.gitlab.com","article","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"An SA story about hyperscaling GitLab Runner workloads using Kubernetes\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Darwin Sanoy\"},{\"@type\":\"Person\",\"name\":\"Brian Wald\"}],\n        \"datePublished\": \"2022-06-29\",\n      }",{"title":668,"description":669,"authors":676,"heroImage":670,"date":679,"body":680,"category":681,"tags":682},[677,678],"Darwin Sanoy","Brian Wald","2022-06-29","\n\nThe following *fictional story*\u003Csup>1\u003C/sup> reflects a repeating pattern that Solutions Architects at GitLab encounter frequently. In the analysis of this story we intend to demonstrate three things: (a) Why one should be thoughtful in leveraging Kubernetes for scaling, (b) How unintended consequences of an approach to automation can create a net productivity loss for an organization (reversal of ROI) and (c) How solutions architecture perspectives can help find anti-patterns - retrospectively or when applied during a development process.\n\n### A DevOps transformation story snippet\n\nGild Investment Trust went through a DevOps transformational effort to build efficiency in their development process through automation with GitLab. Dakota, the application development director, knew that their current system handled about 80 pipelines with 600 total tasks and over 30,000 CI minutes so they knew that scaled CI was needed. Since development occurred primarily during European business hours, they were interested in reducing compute costs outside of peak work hours. Cloud compute was also a target due to acquring the pay per use model combined with elastic scaling.\n\nIngrid was the infrastructure engineer for developer productivity who was tasked with building out the shared GitLab Runner fleet to meet the needs of the development teams. At the beginning of the project she made a successful bid to leverage Kubernetes to scale CI and CD to take advantage of the elastic scaling and high availability all with the efficiency of containers. Ingrid had recently achieved the Certified Kubernetes Administrator (CKA) certification and she was eager to put her knowledge to practical use. She did some additional reading around applications running on Kubernetes and noted the strong emphasis on minimizing the resource profile of microservices to achieve efficiency in the form of compute density. She defined runner containers with 2GB of memory and 750millicores (about three quarters of a CPU) had good results from running some test CI pipelines. She also decided to leverage the Kubernetes Cluster Autoscaler which would use the overall cluster utilization and scheduling to automatically add and remove Kubernetes worker nodes for smooth elastic scaling in response to demand.\n\nAbout 3 months into the proof of concept implementation, Sasha, a developer team lead, noted that many of their new job types were failing with strange error messages. The same jobs ran fine on quickly provisioned GitLab shell runners. Since the primary difference between the environments was the liberal allocation of machine resources in a shell runner, Sasha reasoned that the failures were likely due to the constrained CPU and memory resources of the Kubernetes pods.\n\nTo test this hypothesis, Ingrid decided to add a new pod definition. She found it was difficult to discern which of the job types were failing due to CPU constraints, which ones due to memory constraints and which ones due to the combination of both. She knew it could be a lot of her time to discern the answer. She decided to simply define a pod that was more liberal on both CPU and memory and have it be selectable by runner tagging when more resources were needed for certain CI jobs. She created a GitLab Runner pod definition with 4GB of memory and 1750 millicores of CPU to cover the failing job types. Developers could then use these larger containers when the smaller ones failed by adding the ‘large-container’ tag to their GitLab job.\n\nSasha redid the CI testing and was delighted to find that the new resourcing made all the troubling jobs work fine. Sasha created a guide for developers to try to help discern when mysterious error messages and failed CI jobs were probably the fault of resourcing and then how to add a runner tag to the job to expand the resources.\n\nSome weeks later two of the key jobs that were fixed by the new container resourcing started intermittently failing on NPM package creation jobs for just 3 pipelines on 2 different teams. Of course Sasha tried to understand what the differences were and found that these particular pipelines were packaging notably large file sets because they were actually packaging testing data and the NPM format was a convenient way to provide testing data during automated QA testing.\n\nSasha brought this information to Ingrid and together they did testing to figure out that a 6GB container with 2500 millicores would be sufficient for creating an NPM package out of the current test dataset size. They also discussed whether the development team might want to use a dedicated test data management solution, but it turned out that the teams needs were very simple and that their familiarity with NPM packaging meant that bending NPM packaging to suit their purpose was actually more efficient than acquiring, deploying, learning and maintaining a special system for this purpose. So a new pod resourcing profile was defined and could be accessed with the runner tag ‘xlarge’.\n\nSasha updated the guide for finding the optimal container size through failure testing of CI jobs - but they were not happy with how large the document was getting and how imprecise the process was for determining when a CI job failure was, most likely due to container resource constraints. They were concerned that developers would not go through the process and instead simply pick the largest container resourcing profile in order to avoid the effort of optimizing and they shared this concern with Ingrid. In fact, Sasha noted, they were hard pressed themselves to follow their own guidelines and not to simply choose the largest container for all jobs themselves.\n\nThe potential for this cycle to repeat was halted several months later when Dakota, the app dev director, generated a report that showed a 2% increase in developer time spent optimizing CI jobs using failure testing for container size optimization. Dakota considered this work to be a net new increase because when the company was not using container-based CI, the developers did not have to manage this concern at all. Across 298 developers this amounted to around $840,000/yr dollars of total benefits per month\u003Csup>2\u003C/sup>. It was also thought to add about 2 hours (and growing) to developer onboarding training. It was noted that the report did not attempt to account for the opportunity cost tax - what would these people be doing to solve customer problems with that time? It also did not account for the \"critical moments tax\" (when complexity has an outsized frustration effect and business impact on high pressure, high risk situations).\n\n### Solution architecture retrospective: What went wrong?\n\nThis story reflects a classic antipattern we see at GitLab, not only with regard to Kubernetes runner optimization, but also across other areas, such as overly minimalized build containers and the potential for resultant pipeline complexity as was discussed in a previous blog called [When the pursuit of simplicity creates complexity in container-based CI pipelines](/blog/second-law-of-complexity-dynamics/). Frequently this result comes from inadvertent adherance to heuristics of a small part of the problem as though they were applicable to the entirety of the problem (a type of a logical “fallacy of composition”).\n\nThankfully the emergence of the anti-pattern follows a pattern itself :). Let’s apply a little retrospective solution architecture to the \"what happened\" in order to learn what might be done proactively next time to create better iterations on the next automation project.\n\nThere is a certain approach to landscaping shared greenspaces where, rather than shame people into compliance with signs about not cutting across the grass in key locations, the paths that humans naturally take are interpreted as the signal “there should be a path here.” Humans love beauty and detail in the environments they move through, but depending on the space, they can also value the efficiency of the shortest possible route slightly higher than aesthetics. A wise approach to landscaping holds these factors in a balance that reflects the efficiency versus aesthetic appeal balance of the space user. The space stays beautiful without any shaming required.\n\nIn our story Sasha and Ingrid had exactly this kind of cue where the developers were likely to walk across the grass. If that cue is taken to be a signal that reflects efficiency, we can quickly see what can be done to avoid the antipattern when it starts to occur.\n\nThe signal was the observation that developers might simply choose the largest container all the time to avoid the fussy process of optimizing the compute resources being consumed. Some would consider that laziness and not a good signal to heed. However, most human laziness is deeply rooted in efficiency trade-offs. The developers intuitively understand that their time fussing with failure testing to optimize job containers and their time diagnosing intermittent failures due to the varying content of those jobs, is not worth the amount of compute saved. That is especially true given the opportunity cost of not spending that time innovating the core software solution for the revenue generating application.\n\nIngrid and Sasha’s collaboration has initially missed the scaled human toil factor that was introduced to keep container resources at the minimum tolerable levels. They failed to factor in the escalating cost of scaled human toil to have a comprehensive efficiency measurement. They were following a microservices resourcing pattern which assumes the compute is purpose designed around minimal and well known workloads. When taken as a whole in a shared CI cluster, CI compute follows generalized compute patterns where the needs for CPU, Memory, Disk IO and Network IO can vary wildly from one moment to the next.\n\nIn the broadest analysis, the infrastructure team over indexed to the “team local” optimization of compute efficiency and unintentionally created a global de-optimization of scaled human toil for another team.\n\n## How can this antipattern be avoided?\n\nOne way to combat over indexing on a criteria is to have balancing objectives. This need is covered in \"Measure What Matters\" with the concept of counter balancing objectives. There are some counter balancing questions that can be asked of almost any automation effort. When solution architecture is functioning well these counter balancing questions are asked during the iterative process of building out a solution. Here are some applicable ones for this effort:\n\n**Approporiate Rules: Does the primary compute optimization heuristic match the characteristics of the actual compute workload being optimized?**\n\nThe main benefits of container compute for CI are dependency isolation, dependency encapsulation and a clean build environment for every job. None of these benefits has to do with the extreme resource optimizations available to engineer microservices architected applications. As a whole, CI compute reflects generalized compute, not the ultra-specialized compute of a 12 factor architected micro-service.\n\n**Appropriate granularity: Does optimization need to be applied at every level?**\n\nThe fact that the cluster itself has elastic scaling at the Kubernetes node level is a higher order optimization that will generate significant savings. Another possible optimization that would not require continuous fussing by developers is having a node group running on spot compute (as long as the spot compute runners self-identify their compute as spot so pipeline engineers can select appropriate jobs for spot). These optimizations can create huge savings, without creating scaled human toil.\n\n**People and processes counter check: Does the approach to optimization create scaled human toil by its intensity and/or frequency and/or lack of predictability for any people anywhere in the organization?**\n\nAutomation is all about moving human toil into the world of machines. While optimizing machine resources must always be a primary consideration, it is a lower priority objective than creating a net increase in human toil anywhere in your company. Machines can efficiently and elastically scale, while human workforces respond to scaling needs in months or even years.\n\n### Avoid scaled human toil\n\nNotice that neither the story, nor the qualifying questions, imply there is never a valid reason to have specialized runners that developers might need to select using tags. If a given attribute of runners could be selected once and with confidence then the antipattern would not be in play. One example would be selecting spot compute backed runners for workloads that can tolerate termination. It is the potential for repeated needed attention to calibrate container sizing - made worse by the possibility of intermittent failure based on job content - that pushes this specific scenario into the potential realm of “scaled human toil.” The ability to leverage elastic cluster autoscaling is also a huge help to managing compute resources more efficiently.\n\nIf the risk of scaled human toil could be removed then some of this approach may be able to be preserved. For example, having very large minimum pod resourcing and then a super-size for stuff that breaks the standard pod size just once. Caution is still warranted because it is still possible that developers have to fuss a lot to get a two pod approach working in practice.\n\n### Beware of scaled human toil of an individual\n\nOne thing the story did not highlight is that even if we were able to move all the fussing of such a design to the Infrastructure Engineer persona (perhaps by building an AI tuning mechanism that guesses at pod resourcing for a given job), the cumulative taxes on their role are frequently still not worth the expense. This is, in part, because they have a leveraged role - they help with all the automation of the scaled developer workforce and any time they spend on one activity can’t be spent on another. We humans are generally bad at accounting for opportunity costs - what else could that specific engineer be innovating on to make a stronger overall impact to the organization’s productivity or bottom line? Given the very tight IT labor market, a given function may not be able to add headcount, so opportunity costs take on an outsized importance.\n\n### Unlike people’s time, cloud compute does not carry opportunity cost\n\nA long time ago people had to schedule time on shared computing resources. If the time was used for low-value compute activities it could be taking away time from higher value activities. In this model compute time has an opportunity cost - the cost of what it could be using that time for if it wasn’t doing a lower value activity. Cloud compute has changed this because when compute is not being used, it is not being paid for. Additionally, elastic scaling eliminates the costs of over provisioning hardware and completely eliminates the administrative overhead of procuring capacity - if you need lots for a short period of time it is immediately available. In contrast, people time is not elastically scalable nor pay per use. This means that the opportunity cost question “What could this time be used for if it didn’t have to be spent on low value activities?” is still relevant for anything that creates activities for people.\n\n### The first corollary to the Second Law of Complexity Dynamics\n\nThe Second Law of Complexity Dynamics was introduced in an earlier blog. The essence is that complexity is never destroyed - it is only reformed - and primarily it is moved across a boundary line that dictates whether the management of the complexity is in our domain or externalized. For instance, if you write a function for md5 hashing in your code, you are managing the complexity of that code. If you install a dependency package that contains a premade md5 hash function that you simply use, then the complexity is externalized and managed for you by someone else.\n\nIn this story we are introducing the corollary to that “Law” that “**Exchanging Raw Machine Resources for Complexity Management is Generally a Reasonable Trade-off.**” In this case our scaled human toil is created due to the complexity of unending, daily management of optimizing compute efficiency. This does not mean that burning thousands of dollars of inefficient compute is OK because it saved someone 20 minutes of fussing. It is scoped in the following way:\n\n- scoped to “complexity management” (which is creating the “scaled human toil” in our story) - many minutes of toil that increases proportionally or compounds with more of the activity.\n- scoped to “raw machine resources” - meaning that there is not additional logistics nor human toil to gain the resources. In the cloud raw machine resources are generally available via configuration tweaks.\n- scoped to “generally reasonable” - this indicates a disposition of being very cautious about increasing human toil with an automatoin solution - but it still makes sense to use models or calculations to check if the rule actually holds in a given case.\n\nSo if we can externalize complexity management that is great (The Second Law of Complexity Dynamics). If we can trade complexity management for raw computing resource, that is likely still better than managing it ourselves (The First Corollary).\n\n### Iterating SA: Experimental improvements for your next project\n\nThis post contains specifics that can be used to avoid antipatterns in building out a Kubernetes cluster for GitLab CI. However, in the qualifying questions we’ve attempted to kick it up to one meta-level higher to help assess whether any automation effort may have an “overly local” optimization focus which can inadvertently create a net loss of efficiency across the more global “company context.” It is our opinion that automation efforts that create a net loss in human productivity should not be classified as automation at all. While it’s strong medicine to apply to one’s work, we feel that doing so causes appropriate innovation pressure to ensure that individual automation efforts truly deliver on their inherent promise of higher human productivity and efficiency. So simply ask “Does this way of solving a problem cause recurring work for anyone?”\n\n### DevOps transformation and solution architecture perspectives\n\nA technology architecture focus rightfully hones in on the technology choices for a solution build. However, if it is the only lens, it can result in scenarios like our story. Solutions architecture steps back to a broader perspective to sanity-check that solution iterations account for a more complete picture of both the positive and negative impacts across all three of people, processes and technology. As an organizational competency, DevOps emphasis solution architecture perspectives when it is defined as a collaborative and cultural approach to people, processes and technology.\n\nFootnotes:\n\n1. This fictional story was devised specifically for this article and does not knowingly reflect the details of any other published story or an actual situation. The names used in the story are from [GitLab’s list of personas](https://handbook.gitlab.com/handbook/product/personas/).\n2. Across a team of 300 full time developers. 9.6min/workday x 250 workdays / year = 2400mins / 8hrs/workday  = 5 workdays x $560 per day (140K Total Comp/250days) = $2800/dev/year x 300 developers = $840,000/yr\n\nCover image by [Kaleidico](https://unsplash.com/@kaleidico?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/)\n","engineering",[683,684,9,685],"CI","CD","solutions architecture",{"slug":687,"featured":6,"template":688},"a-story-of-runner-scaling","BlogPost","content:en-us:blog:a-story-of-runner-scaling.yml","A Story Of Runner Scaling","en-us/blog/a-story-of-runner-scaling.yml","en-us/blog/a-story-of-runner-scaling",{"_path":694,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":695,"content":701,"config":711,"_id":713,"_type":13,"title":714,"_source":15,"_file":715,"_stem":716,"_extension":18},"/en-us/blog/adsoul-devops-transition-to-gitlab-ci",{"title":696,"description":697,"ogTitle":696,"ogDescription":697,"noIndex":6,"ogImage":698,"ogUrl":699,"ogSiteName":672,"ogType":673,"canonicalUrls":699,"schema":700},"How adSoul transitioned to GitLab CI from Jenkins","adSoul, a marketing automation company, outlines a successful three-phase migration plan for moving to GitLab CI from Jenkins.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749678442/Blog/Hero%20Images/londoncommit.png","https://about.gitlab.com/blog/adsoul-devops-transition-to-gitlab-ci","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How adSoul transitioned to GitLab CI from Jenkins\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Brein Matturro\"}],\n        \"datePublished\": \"2019-11-05\",\n      }",{"title":696,"description":697,"authors":702,"heroImage":698,"date":704,"body":705,"category":681,"tags":706},[703],"Brein Matturro","2019-11-05","\nadSoul is a Germany-based data-driven online marketing company that aims to improve search engine advertising and scalability for businesses. The core of adSoul relies heavily on API interfaces and entity recognition to post keywords on Google and Bing with marketing automation. \n\nAt GitLab Commit London, [Philipp Westphalen](https://www.linkedin.com/in/philipp-westphalen-a83318188/), fullstack developer at adSoul and GitLab Hero, shares how the company transitioned from Jenkins to GiLab CI. adSoul is a startup company with five developers, and as Philipp says “We literally have no time for everything we need to do.” They were looking for a tool that requires less time-consuming maintanence, and with Jenkins the team found it hard to read their existing files. “Our Jenkins was not so stable at all and it was tough to change because it was managed by our provider,” Philipp says. Cost and visibility were also huge motivators in moving away from [Jenkins to Gitlab CI](/blog/docker-my-precious/).\n\n## GitLab migration in three phases\n\nPhase 1: Move the repository.\nThe [adSoul team](https://www.adsoul.com) used the GitHub Import by GitLab, but had setbacks with migrating their issues, so they created a GitHub open source issue migrator as a resolution. Following that, they modified scripts with the new origin by exchanging the GitHub API call with a GitLab API. “This was really easy and we had a stable build with our new repository, so we could move our product management to GitLab and not need GitHub anymore,” Philipp says.\n\nPhase 2: Migrate the CI/CD pipeline.\nThe team started to create a GitLab CI YAML and tried to do a simple ‘lift and shift,’ however their processes were more complicated than anticipated. Though this phase was time consuming, it became clear the team could move to phase three without hiccups. “Quick pro tip,” says Philipp. “If you’re running your own GitLab runners, increase the log limit if you have to debug your building step.” \n\nPhase 3: Improve the CI/CD pipeline.\nThe team thought about ways of building their software, so they split projects into steps. “Our idea was that one job does one thing perfectly. Each job is simple and everyone can modify it easily” Philipp says. They improved their build time by moving to Gradle, created parallel job processing, and by using standard Docker images for ease of management. \n\n## Takeaways from a successful migration\n\n1. Plan your migration. Get every member of the team involved and aware of the upcoming changes, including how tools are working together and what the expectations are moving forward. “Take your time for the migration,” Philipp says. “It’s not two days and then we are finished.” \n\n2. Go step by step. adSoul used a three phase plan which allowed the team to deploy a new version and still continue to work on existing projects. “We could improve our application without having to wait for a better infrastructure,” Philipp says.\n\n3. Rethink your [DevOps strategy](/blog/better-devops-with-gitlab-ci-cd/). In the time leading up to the migration, examine things like security automation and other important pieces in a DevOps overall strategy.\n\n4. Start with a small project. Work closely with colleagues to create small GitLab CI projects to familiarize everyone before creating larger, overwhelming projects.\n\nPro tip: Keep your pipeline user friendly. Create a good user experience for the team with clear job names, style your config for a better overview, and write comments for variables and hard to understand code. \n\n## Why GitLab works for a small team\n\n“The most important thing is that GitLab is a powerful CI/CD solution with high customization,” Philipp says. There is one home for all projects, without dependencies on one another. With Jenkins, even small exploratory changes can impact the larger job. “With GitLab, you don’t have dependency between branches. So, if you’re trying something new for your CI, you can do it simply in your branch and the master branch will not be affected by the changes,” Philipp says.\n\nThe CI is low maintenance, which is a useful timesaver for a smaller team. “The CI provides us with really low maintenance time. So, usually we don’t have to care about our CI for a month or more,” Philipp says.\n\nTo learn more about adSoul’s migration to GitLab, watch Philipp’s talk from GitLab Commit London.\n\n\u003C!-- blank line -->\n\u003Cfigure class=\"video_container\">\n  \u003Ciframe src=\"https://www.youtube.com/embed/C5xfw0ydh2k\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\u003C!-- blank line -->\n",[108,707,708,9,709,710],"DevOps","open source","startups","cloud native",{"slug":712,"featured":6,"template":688},"adsoul-devops-transition-to-gitlab-ci","content:en-us:blog:adsoul-devops-transition-to-gitlab-ci.yml","Adsoul Devops Transition To Gitlab Ci","en-us/blog/adsoul-devops-transition-to-gitlab-ci.yml","en-us/blog/adsoul-devops-transition-to-gitlab-ci",{"_path":718,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":719,"content":725,"config":733,"_id":735,"_type":13,"title":736,"_source":15,"_file":737,"_stem":738,"_extension":18},"/en-us/blog/battling-toolchain-technical-debt",{"title":720,"description":721,"ogTitle":720,"ogDescription":721,"noIndex":6,"ogImage":722,"ogUrl":723,"ogSiteName":672,"ogType":673,"canonicalUrls":723,"schema":724},"Battling toolchain technical debt","DevOps teams can hinder the software development lifecycles and application performance if they let their toolchains become unruly. Read how GitLab can help reduce that technical debt.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749667845/Blog/Hero%20Images/gl15.jpg","https://about.gitlab.com/blog/battling-toolchain-technical-debt","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Battling toolchain technical debt\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sandra Gittlen\"}],\n        \"datePublished\": \"2022-06-21\",\n      }",{"title":720,"description":721,"authors":726,"heroImage":722,"date":728,"body":729,"category":730,"tags":731},[727],"Sandra Gittlen","2022-06-21","\nDevelopers love their tools. Operations teams love their tools. And security teams love their tools. As Dev, Sec, and Ops consolidate onto a single DevOps platform, toolchain technical debt becomes exponentially more costly and complex.\n\n“Tools should be in the background enabling excellent development, operations, and security practices. However, DevOps teams are often led by their tools rather than the other way around and that can hinder all aspects of the software development lifecycle (SDLC),” says [Cindy Blake](https://gitlab.com/cblake), CISSP, director of product and solutions marketing at GitLab.\n\nAn April 2022 Gartner® report titled “Beware the DevOps Toolchain Debt Collector” notes that “many organizations find themselves with outdated, poorly governed, and unmanageable toolchains as they scale DevOps initiatives.”\n\nOne of the key findings, according to Gartner, is that “most organizations create homegrown toolchains, often leveraging the tools beyond their functional design. This not only leads to a fragmented toolchain, but also creates complications when tooling needs to be scaled, replaced, or updated.”\n\nToolchain technical debt introduces complexity as companies shift critical tasks such as reliability, governance, and compliance left in the SDLC.\n\n> Discover how GitLab 15 can help your team deliver secure software, while maintaining compliance and automating manual processes.\nSave the date for our GitLab 15 [launch event](https://page.gitlab.com/fifteen) on June 23rd!\n\n## No time for technical debt\n\nFew DevOps teams give toolchain upkeep the time and attention it requires. According to [GitLab’s 2021 DevSecOps\nsurvey](/images/developer-survey/gitlab-devsecops-2021-survey-results.pdf), nearly two-thirds of survey respondents, 61%, said they spend 20% or less of their time on toolchain integration and maintenance each month.\n\n“Developers face challenges and time constraints while maintaining these complex, stand-alone tool siloes, building fragility and technical debt that the [infrastructure and operations] leader has to deal with,” Gartner states. The research firm adds, “These outdated toolchains further increase overhead costs, magnify technical risks, add operational toil, and limit business agility.”\n\nBlake agrees: “Complex toolchains inhibit the ability to govern the software development and deployment process. Policies must be managed across tools and visibility into code changes and changes to its surrounding infrastructure become difficult to see and track. Time is wasted on managing the toolchain instead of value-added work.”\n\n## Getting purpose-driven\nThe remedy to toolchain sprawl and subsequent debt is to change strategy. Instead of putting energy into figuring out how to maintain one-off tools, DevOps teams should focus on how to enable processes and policies that support simplicity, control, and visibility across the SDLC.\n\n“These are the characteristics needed to meet reliability, governance, and compliance demands. A united platform like GitLab helps you do that,” Blake says.\n\nGartner states: “Successful infrastructure and operations leaders reduce technical debt and sustainably scale DevOps toolchain initiatives across the organization by using a prioritized, iterative strategy that minimizes friction in making changes to toolchains and more quickly delivers customer value.”\n\nAdopting a purpose-built platform instead of a complex and ad-hoc toolchain also eases an organization’s ability to automate the SDLC. “Automation abstracts complexity away from the developer and provides guard rails so DevOps teams gain greater efficiency, accuracy, and consistency,” Blake says. In addition, automation reduces the audit footprint in terms of what needs oversight and inspection.\n\nPlatforms also support automation throughout operations, including building and\ntesting infrastructure as code, so that “you can eliminate the variables when you’re trying to debug an application,” she says. This speeds troubleshooting response times and reduces application downtime.\n\nFor instance, GitLab, the One DevOps Platform, features [dependency\nlists](https://docs.gitlab.com/ee/user/application_security/dependency_list/), also known as software bill of materials (SBOM), that show which dependencies were used and help to identify where problems exist. “GitLab also helps you avoid problems altogether by consistently scanning dependencies according to policies and compliance standards that the platform provides,” Blake says. DevOps teams can easily see what changes were made when and by whom. “That visibility is critical when trying to resolve issues and prevent them from happening again,” she says.\n\n## Reclaim your DevOps team’s time\nBy adopting a single DevOps platform, organizations can reclaim developer, security, and operations time that has been spent stitching tools together or optimizing for one developer’s tool, and then backtracking through toolchains when an application breaks because those tools can’t co-exist.\n\n“DevOps teams have a lot on their plates and trying to manage unruly toolchains is simply a waste of time. You should be creating state-of-the-art software, not manually integrating and maintaining legacy tools,” Blake says.\n\nShe emphasizes that GitLab is not “rip and replace”; it’s a platform where everything needed for DevOps comes together in one place. IT leadership benefits from this united approach as well. [Value stream\nanalytics](/solutions/value-stream-management/) provide insight into your end-to-end software throughput, helping optimize IT resources most efficiently and enabling a flexible, responsive business outcome. “We meet DevOps teams where they are and put the user – whether they be a developer, operations, or security professional – in the center of the platform,” she says.\n\n[Try GitLab Ultimate for free](/free-trial/\n) for 30 days.\n\n_GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved._\n","devsecops",[707,9,732],"workflow",{"slug":734,"featured":6,"template":688},"battling-toolchain-technical-debt","content:en-us:blog:battling-toolchain-technical-debt.yml","Battling Toolchain Technical Debt","en-us/blog/battling-toolchain-technical-debt.yml","en-us/blog/battling-toolchain-technical-debt",{"_path":740,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":741,"content":747,"config":760,"_id":762,"_type":13,"title":763,"_source":15,"_file":764,"_stem":765,"_extension":18},"/en-us/blog/building-gitlab-with-gitlab-a-multi-region-service-to-deliver-ai-features",{"title":742,"description":743,"ogTitle":742,"ogDescription":743,"noIndex":6,"ogImage":744,"ogUrl":745,"ogSiteName":672,"ogType":673,"canonicalUrls":745,"schema":746},"Building GitLab with GitLab: A multi-region service to deliver AI features","Discover how we built our first multi-region deployment for teams at GitLab using the platform's many features, helping create a frictionless developer experience for GitLab Duo users.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098664/Blog/Hero%20Images/Blog/Hero%20Images/building-gitlab-with-gitlab-no-type_building-gitlab-with-gitlab-no-type.png_1750098663794.png","https://about.gitlab.com/blog/building-gitlab-with-gitlab-a-multi-region-service-to-deliver-ai-features","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Building GitLab with GitLab: A multi-region service to deliver AI features\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Chance Feick\"},{\"@type\":\"Person\",\"name\":\"Sam Wiskow\"}],\n        \"datePublished\": \"2024-09-12\",\n      }",{"title":742,"description":743,"authors":748,"heroImage":744,"date":751,"body":752,"category":681,"tags":753},[749,750],"Chance Feick","Sam Wiskow","2024-09-12","For GitLab Duo, real-time AI-powered capabilities like [Code Suggestions](https://about.gitlab.com/solutions/code-suggestions/) need low-latency response times for a frictionless developer experience. Users don’t want to interrupt their flow and wait for a code suggestion to show up. To ensure GitLab Duo can provide the right suggestion at the right time and meet high performance standards for critical AI infrastructure, GitLab recently launched our first multi-region service to deliver AI features.\n\nIn this article, we will cover the benefits of multi-region services, how we built an internal platform codenamed ‘Runway’ for provisioning and deploying multi-region services using GitLab features, and the lessons learned migrating to multi-region in production.\n\n## Background on the project\n\nRunway is GitLab’s internal platform as a service (PaaS) for provisioning, deploying, and operating containerized services. Runway's purpose is to enable GitLab service owners to self-serve infrastructure needs with production readiness out of the box, so application developers can focus on providing value to customers. As part of [our corporate value of dogfooding](https://handbook.gitlab.com/handbook/values/#results), the first iteration was built in 2023 by the Infrastructure department on top of core GitLab capabilities, such as continuous integration/continuous delivery ([CI/CD](https://about.gitlab.com/topics/ci-cd/)), environments, and deployments.\n\nBy establishing automated GitOps best practices, Runway services use infrastructure as code (IaC), merge requests (MRs), and CI/CD by default.\n\nGitLab Duo is primarily powered by [AI Gateway](https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist), a satellite service written in Python outside of GitLab’s modular monolith written in Ruby. In cloud computing, a region is a geographical location of data centers operated by cloud providers.\n\n## Defining a multi-region strategy\n\nDeploying in a single region is a good starting point for most services, but can come with downsides when you are trying to reach a global audience. Users who are geographically far from where your service is deployed may experience different levels of service and responsiveness than those who are closer. This can lead to a poor user experience, even if your service is well built in all other respects.\n\nFor AI Gateway, it was important to meet global customers wherever they are located, whether on GitLab.com or self-managed instances using Cloud Connector. When a developer is deciding to accept or reject a code suggestion, milliseconds matter and can define the user experience.\n\n### Goals\n\nMulti-region deployments require more infrastructure complexity, but for use cases where latency is a core component of the user experience, the benefits often outweigh the downsides. First, multi-region deployments offer increased responsiveness to the user. By serving requests from locations closest to end users, latency can be significantly reduced. Second, multi-region deployments provide greater availability. With fault tolerance, services can fail over during a regional outage. There is a much lower chance of a service failing completely, meaning users should not be interrupted even in partial failures.\n\nBased on our goals for performance and availability, we used this opportunity to create a scalable multi-region strategy in Runway, which is built leveraging GitLab features.\n\n### Architecture\n\nIn SaaS platforms, GitLab.com’s infrastructure is hosted on Google Cloud Platform (GCP). As a result, Runway’s first supported platform runtime is Cloud Run. The initial workloads deployed on Runway are stateless satellite services (e.g., AI Gateway), so Cloud Run services are a good fit that provide a clear migration path to more complex and flexible platform runtimes, e.g. Kubernetes.\n\nBuilding Runway on top of GCP Cloud Run using GitLab has allowed us to iterate and tease out the right level of abstractions for service owners as part of a platform play in the Infrastructure department.\n\nTo serve traffic from multiple regions in Cloud Run, the multi-region deployment strategy must support global load balancing, and the provisioning and configuration of regional resources. Here’s a simplified diagram of the proposed architecture in GCP:\n\n![simplified diagram of the proposed architecture in GCP](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098671/Blog/Content%20Images/Blog/Content%20Images/image7_aHR0cHM6_1750098671612.png)\n\nBy replicating Cloud Run services across multiple regions and configuring the existing global load balancing with serverless network endpoint group (NEG) backends, we’re able to serve traffic from multiple regions. For the remainder of the article, we’ll focus less on specifics of Cloud Run and more on how we’re building with GitLab.\n\n## Building a multi-region platform with GitLab\n\nNow that you have context about Runway, let's walk through how to build a multi-region platform using GitLab features.\n\n### Provision\n\nWhen building an internal platform, the first challenge is provisioning infrastructure for a service. In Runway, Provisioner is the component that is responsible for maintaining a service inventory and managing IaC for GCP resources using Terraform.\n\nTo provision a service, an application developer will open an MR to add a service project to the inventory using git, and Provisioner will create required resources, such as service accounts and identity and access management policies. When building this functionality with GitLab, Runway leverages [OpenID Connect (OIDC) with GPC Workload Identity Federation](https://docs.gitlab.com/ee/ci/cloud\\_services/google\\_cloud/) for managing IaC.\n\nAdditionally, Provisioner will create a deployment project for each service project. The purpose of creating separate projects for deployments is to ensure the [principle of least privilege](https://about.gitlab.com/blog/the-ultimate-guide-to-least-privilege-access-with-gitlab/) by authenticating as a GCP service account with restricted permissions. Runway leverages the [Projects API](https://docs.gitlab.com/ee/api/projects.html) for creating projects with [Terraform provider](https://registry.terraform.io/providers/gitlabhq/gitlab/latest/docs).\n\nFinally, Provisioner defines variables in the deployment project for the service account, so that deployment CI jobs can authenticate to GCP. Runway leverages [CI/CD variables](https://docs.gitlab.com/ee/ci/variables/) and [Job Token allowlist](https://docs.gitlab.com/ee/ci/jobs/ci\\_job\\_token.html\\#add-a-group-or-project-to-the-job-token-allowlist) to handle authentication and authorization.\n\nHere’s a simplified example of provisioning a multi-region service in the service inventory:\n\n```\n{\n  \"inventory\": [\n    {\n      \"name\": \"example-service\",\n      \"project_id\": 46267196,\n      \"regions\": [\n        \"europe-west1\",\n        \"us-east1\",\n        \"us-west1\"\n      ]\n    }\n  ]\n}\n```\n\nOnce provisioned, a deployment project and necessary infrastructure will be created for a service.\n\n### Configure\n\nAfter a service is provisioned, the next challenge is the configuration for a service. In Runway, [Reconciler](https://gitlab.com/gitlab-com/gl-infra/platform/runway/runwayctl) is a component that is responsible for configuring and deploying services by aligning the actual state with the desired state using Golang and Terraform.\n\nHere’s a simplified example of an application developer configuring GitLab CI/CD in their service project:\n\n```\n# .gitlab-ci.yml\nstages:\n  - validate\n  - runway_staging\n  - runway_production\n\ninclude:\n  - project: 'gitlab-com/gl-infra/platform/runway/runwayctl'\n    file: 'ci-tasks/service-project/runway.yml'\n    inputs:\n      runway_service_id: example-service\n      image: \"$CI_REGISTRY_IMAGE/${CI_PROJECT_NAME}:${CI_COMMIT_SHORT_SHA}\"\n      runway_version: v3.22.0\n\n# omitted for brevity\n```\n\nRunway provides sane default values for configuration that are based on our experience in delivering stable and reliable features to customers. Additionally, service owners can configure infrastructure using a service manifest file hosted in a service project. The service manifest uses JSON Schema for validation. When building this functionality with GitLab, Runway leverages [Pages](https://docs.gitlab.com/ee/user/project/pages/) for schema documentation.\n\nTo deliver this part of the platform, Runway leverages [CI/CD templates](https://docs.gitlab.com/ee/development/cicd/templates.html), [Releases](https://docs.gitlab.com/ee/user/project/releases/), and [Container Registry](https://docs.gitlab.com/ee/user/packages/container\\_registry/) for integrating with service projects.\n\nHere’s a simplified example of a service manifest:\n\n```\n# .runway/runway-production.yml\napiVersion: runway/v1\nkind: RunwayService\nspec:\n container_port: 8181\n regions:\n   - us-east1\n   - us-west1\n   - europe-west1\n\n# omitted for brevity\n```\n\nFor multi-region services, Runway injects an environment variable into the container instance runtime, e.g. RUNWAY\\_REGION, so application developers have the context to make any downstream dependencies regionally-aware, e.g. Vertex AI API.\n\nOnce configured, a service project will be integrated with a deployment project.\n\n### Deploy\n\nAfter a service project is configured, the next challenge is deploying a service. In Runway, Reconciler handles this by triggering a deployment job in the deployment project when an MR is merged to the main branch. When building this functionality with GitLab, Runway leverages [Trigger Pipelines](https://docs.gitlab.com/ee/ci/triggers/) and [Multi-Project Pipelines](https://docs.gitlab.com/ee/ci/pipelines/downstream\\_pipelines.html\\#multi-project-pipelines) to trigger jobs from service project to deployment project.\n\n![trigger jobs from service project to deployment project](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098672/Blog/Content%20Images/Blog/Content%20Images/image5_aHR0cHM6_1750098671612.png)\n\nOnce a pipeline is running in a deployment project, it will be deployed to an environment. By default, Runway will provision staging and production environments for all services. At this point, Reconciler will apply any Terraform resource changes for infrastructure. When building this functionality with GitLab, Runway leverages [Environments/Deployments](https://docs.gitlab.com/ee/ci/environments/) and [GitLab-managed Terraform state](https://docs.gitlab.com/ee/user/infrastructure/iac/terraform\\_state.html) for each service.\n\n![Reconciler applies any Terraform resource changes for infrastructure](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098672/Blog/Content%20Images/Blog/Content%20Images/image1_aHR0cHM6_1750098671614.png)\n\nRunway provides default application metrics for services. Additionally, custom metrics can be used by enabling a sidecar container with OpenTelemetry Collector configured to scrape Prometheus and remote write to Mimir. By providing observability out of the box, Runway is able to bake monitoring into CI/CD pipelines.\n\nExample scenarios include gradual rollouts for blue/green deployments, preventing promotions to production when staging is broken, or automatically rolling back to previous revision when elevated error rates occur in production.\n\n![Runway bakes monitoring into CI/CD pipelines](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098672/Blog/Content%20Images/Blog/Content%20Images/image2_aHR0cHM6_1750098671615.png)\n\nOnce deployed, environments will serve the latest revision of a service. At this point, you should have a good understanding of some of the challenges that will be encountered, and how to solve them with GitLab features.\n\n## Migrating to multi-region in production\n\nAfter extending Runway components to support multi-region in Cloud Run, the final challenge was migrating from AI Gateway’s single-region deployment in production with zero downtime. Today, teams using Runway to deploy their services can self-serve on regions making a multi-region deployment just as simple as a single-region deployment. \n\nWe were able to iterate on building multi-region functionality without impacting existing infrastructure by using semantic versioning for Runway. Next, we’ll share some learnings from the migration that may inform how to operate services for an internal multi-region platform.\n\n### Dry run deployments\n\nIn Runway, Reconciler will apply Terraform changes in CI/CD. The trade-off is that plans cannot be verified in advance, which could risk inadvertently destroying or misconfiguring production infrastructure. To solve this problem, Runway will perform a “dry run” deployment for MRs.\n\n![\"Dry run\" deployment](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098672/Blog/Content%20Images/Blog/Content%20Images/image6_aHR0cHM6_1750098671616.png)\n\nFor migrating AI Gateway, dry run deployments increased confidence and helped mitigate risk of downtime during rollout. When building an internal platform with GitLab, we recommend supporting dry run deployments from the start.\n\n### Regional observability\n\nIn Runway, existing observability was aggregated by assuming a single-region deployment. To solve this problem, Runway observability was retrofitted to include a new region label for Prometheus metrics.\n\nOnce metrics were retrofitted, we were able to introduce service level indicators (SLIs) for both regional Cloud Run services and global load balancing. Here’s an example dashboard screenshot for a general Runway service:\n\n![dashboard screenshot for a general Runway service](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098672/Blog/Content%20Images/Blog/Content%20Images/image3_aHR0cHM6_1750098671617.png)\n\n***Note:** Data is not actual production data and is only for illustration purposes.*\n\nAdditionally, we were able to update our service level objectives (SLOs) to support regions. As a result, service owners could be alerted when a specific region experiences an elevated error rate, or increase in response times.\n\n![screenshot of alerts](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098672/Blog/Content%20Images/Blog/Content%20Images/image4_aHR0cHM6_1750098671617.png)\n\n***Note:** Data is not actual production data and is only for illustration purposes.*\n\nFor migrating AI Gateway, regional observability increased confidence and helped provide more visibility into new infrastructure. When building an internal platform with GitLab, we recommend supporting regional observability from the start.\n\n### Self-service regions\n\nThe Infrastructure department successfully performed the initial migration of multi-region support for AI Gateway in production with zero downtime. Given the risk associated with rolling out a large infrastructure migration, it was important to ensure the service continued working as expected.\n\nShortly afterwards, service owners began self-serving additional regions to meet the growth of customers. At the time of writing, [GitLab Duo](https://about.gitlab.com/gitlab-duo/) is available in six regions around the globe and counting. Service owners are able to configure the desired regions, and Runway will provide guardrails along the way in a scalable solution.\n\nAdditionally, three other internal services have already started using multi-region functionality on Runway. Application developers have entirely self-served functionality, which validates that we’ve provided a good platform experience for service owners. For a platform play, a scalable solution like Runway is considered a good outcome since the Infrastructure department is no longer a blocker.\n\n## What’s next for Runway\n\nBased on how quickly we could iterate to provide results for customers, the SaaS Platforms department has continued to invest in Runway. We’ve grown the Runway team with additional contributors, started evolving the platform runtime (e.g. Google Kubernetes Engine), and continue dogfooding with tighter integration in the product.\n\nIf you’re interested in learning more, feel free to check out [https://gitlab.com/gitlab-com/gl-infra/platform/runway](https://gitlab.com/gitlab-com/gl-infra/platform/runway).\n\n## More Building GitLab with GitLab\n- [Why there is no MLOps without DevSecOps](https://about.gitlab.com/blog/there-is-no-mlops-without-devsecops/)\n- [Stress-testing Product Analytics](https://about.gitlab.com/blog/building-gitlab-with-gitlab-stress-testing-product-analytics/)\n- [Web API Fuzz Testing](https://about.gitlab.com/blog/building-gitlab-with-gitlab-api-fuzzing-workflow/)\n- [How GitLab.com inspired Dedicated](https://about.gitlab.com/blog/building-gitlab-with-gitlabcom-how-gitlab-inspired-dedicated/)\n- [Expanding our security certification portfolio](https://about.gitlab.com/blog/building-gitlab-with-gitlab-expanding-our-security-certification-portfolio/)\n",[108,684,683,754,755,9,756,757,758,759],"inside GitLab","tutorial","google","git","DevSecOps","AI/ML",{"slug":761,"featured":90,"template":688},"building-gitlab-with-gitlab-a-multi-region-service-to-deliver-ai-features","content:en-us:blog:building-gitlab-with-gitlab-a-multi-region-service-to-deliver-ai-features.yml","Building Gitlab With Gitlab A Multi Region Service To Deliver Ai Features","en-us/blog/building-gitlab-with-gitlab-a-multi-region-service-to-deliver-ai-features.yml","en-us/blog/building-gitlab-with-gitlab-a-multi-region-service-to-deliver-ai-features",{"_path":767,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":768,"content":774,"config":782,"_id":784,"_type":13,"title":785,"_source":15,"_file":786,"_stem":787,"_extension":18},"/en-us/blog/building-gitlab-with-gitlab-stress-testing-product-analytics",{"title":769,"description":770,"ogTitle":769,"ogDescription":770,"noIndex":6,"ogImage":771,"ogUrl":772,"ogSiteName":672,"ogType":673,"canonicalUrls":772,"schema":773},"Building GitLab with GitLab: Stress-testing Product Analytics","We put Product Analytics through its paces internally to prep it for Beta. Find out what that entailed and how it led to feature improvements.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749659740/Blog/Hero%20Images/building-gitlab-with-gitlab-no-type.png","https://about.gitlab.com/blog/building-gitlab-with-gitlab-stress-testing-product-analytics","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Building GitLab with GitLab: Stress-testing Product Analytics\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"James Heimbuck\"},{\"@type\":\"Person\",\"name\":\"Sam Kerr\"}],\n        \"datePublished\": \"2023-12-14\",\n      }",{"title":769,"description":770,"authors":775,"heroImage":771,"date":778,"body":779,"category":730,"tags":780},[776,777],"James Heimbuck","Sam Kerr","2023-12-14","To best understand how your features being developed and shipped are helping you meet your goals, you need data. The previously announced [Product Analytics feature set](https://about.gitlab.com/blog/introducing-product-analytics-in-gitlab/) helps our customers do just that by providing tools to instrument code and process and visualize the data – all within GitLab.\n\n## Privacy first\n\nWe know customer privacy is a big concern for our customers and our customer's customers. As we said in our [announcement blog](https://about.gitlab.com/blog/introducing-product-analytics-in-gitlab/#our-continued-commitment-to-user-privacy):\n\n\u003Cp>\u003Ccenter>\"Product Analytics is designed to honor commonly recognized opt-out signals and we are designing Product Analytics to give you full control over the data being collected on a cluster managed by GitLab or your own.\"\u003C/center>\u003C/p>\n\nNothing about that approach has changed and it is too important not to mention again.\n\n## Customer Zero and the biggest customer\n\nWe are progressing quickly towards the open beta for Product Analytics. We are currently feature-complete for the beta with the managed product analytics stack, [five existing SDKs for instrumentation](https://docs.gitlab.com/ee/user/product_analytics/#instrument-a-gitlab-project), [default dashboards](https://docs.gitlab.com/ee/user/analytics/analytics_dashboards.html#product-analytics), and the recently released  improved Dashboard and Visualization Designer experiences. We are also learning more about what problems our internal users still have that they cannot solve with Product Analytics.\n\nAs we prepare for the Beta release of Product Analytics, it is important for us to know how the Managed Product Analytics stack will stand up to a bigger event load than we are getting from the initial customers and internal users. With our commitment to dogfooding, adding more internal projects was the obvious answer, so we worked with more internal teams to add instrumentation for the Metrics Dictionary and [GitLab Design System](https://design.gitlab.com/) sites.\n\nInstrumenting internal projects gave us additional feedback about the setup of Product Analytics and the usefulness of the Audience and Behavior Dashboards, showing how many users were visiting and what pages they visited. These gave us great insights into the usefulness of Product Analytics, but did not provide the volume of events we needed to really stress test Product Analytics at the scale we wanted. \n\n![product-analytics-default-dashboard-list](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749683252/Blog/Content%20Images/product-analytics-default-dashboard-list.png)\n\nAt the same time the Analytics Instrumentation team was hard at work developing an event framework to make instrumentation easier for GitLab developers. This lets the GitLab teams create new features and update existing ones faster to understand how changes impact our users. This also made it much easier and faster to add Product Analytics to GitLab.com, which provided the event volume that would stress test the Product Analytics stack so we could validate our assumptions.\n\nOnce fully enabled, with all page views and events going to the Managed Product Analytics stack, we saw a 17x increase in load above all other internally instrumented projects, receiving over 20 million events a day. That is a lot of events!\n\nBy instrumenting GitLab.com, we were able to see the stress cracks in our infrastructure _before_ introducing the features to users in our Beta. We were able to validate our scaling strategies, identify and resolve query performance concerns, improve the onboarding experience for our upcoming Beta program, and plan future improvements as we work towards [general availability](https://gitlab.com/groups/gitlab-org/-/epics/9902).\n\nWe have also proved to ourselves that Product Analytics can stand up to future customer load without making customers suffer through outages or slowness as we make the stack better.\n\n## What’s next for Product Analytics\n\nThroughout the internal release and the experiment phase, we have been talking to customers about what is and is not working with Product Analytics, especially the [built-in dashboards](https://docs.gitlab.com/ee/user/analytics/analytics_dashboards.html#product-analytics). From that feedback we have a number of improvements in mind that can't all fit here but check out our [Product Analytics direction page](https://about.gitlab.com/direction/monitor/product-analytics/#what-is-next-for-us-and-why) to see the latest on what improvements are coming next.\n\nTalking directly with users of Product Analytics is also informing the next iterations of other features like [Customizable Dashboards](https://gitlab.com/groups/gitlab-org/-/epics/8574) and [Visualization Designer](https://gitlab.com/groups/gitlab-org/-/epics/9386). The team is also exploring ways to [leverage AI](https://gitlab.com/groups/gitlab-org/-/epics/10335) to make it easier to find and understand Product Analytics data. \n\n## Share your feedback\n\nIt is an exciting time in product analytics and we cannot wait for you to try the feature out yourself! You can add ideas or comments to our [feedback issue](https://gitlab.com/gitlab-org/gitlab/-/issues/391970). We look forward to hearing from you!\n\n## Read more \"Building GitLab with GitLab\"\n\n- [Building GitLab with GitLab: How GitLab.com inspired Dedicated](https://about.gitlab.com/blog/building-gitlab-with-gitlabcom-how-gitlab-inspired-dedicated/)\n- [Building GitLab with GitLab: Web API Fuzz Testing](https://about.gitlab.com/blog/building-gitlab-with-gitlab-api-fuzzing-workflow/)\n",[758,781,9,754],"product",{"slug":783,"featured":90,"template":688},"building-gitlab-with-gitlab-stress-testing-product-analytics","content:en-us:blog:building-gitlab-with-gitlab-stress-testing-product-analytics.yml","Building Gitlab With Gitlab Stress Testing Product Analytics","en-us/blog/building-gitlab-with-gitlab-stress-testing-product-analytics.yml","en-us/blog/building-gitlab-with-gitlab-stress-testing-product-analytics",{"_path":789,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":790,"content":796,"config":802,"_id":804,"_type":13,"title":805,"_source":15,"_file":806,"_stem":807,"_extension":18},"/en-us/blog/compose-readers-and-writers-in-golang-applications",{"title":791,"description":792,"ogTitle":791,"ogDescription":792,"noIndex":6,"ogImage":793,"ogUrl":794,"ogSiteName":672,"ogType":673,"canonicalUrls":794,"schema":795},"Compose Readers and Writers in Golang applications","GitLab streams terabytes of Git data every hour using Golang abstractions of I/O implementations. Learn how to compose Readers and Writers in Golang apps.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099464/Blog/Hero%20Images/Blog/Hero%20Images/AdobeStock_639935439_3oqldo5Yt5wPonEJYZOLTM_1750099464124.jpg","https://about.gitlab.com/blog/compose-readers-and-writers-in-golang-applications","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Compose Readers and Writers in Golang applications\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Igor Drozdov\"}],\n        \"datePublished\": \"2024-02-15\",\n      }",{"title":791,"description":792,"authors":797,"heroImage":793,"date":799,"body":800,"category":681,"tags":801},[798],"Igor Drozdov","2024-02-15","Every hour, GitLab transfers terabytes of Git data between a server and a client. It is hard or even impossible to handle this amount of traffic unless it is done efficiently in a streaming fashion. Git data is served by Gitaly (Git server), GitLab Shell (Git via SSH), and Workhorse (Git via HTTP(S)). These services are implemented using Go - the language that conveniently provides abstractions to efficiently deal with I/O operations.\n\nGolang's [`io`](https://pkg.go.dev/io) package provides [`Reader`](https://pkg.go.dev/io#Reader) and [`Writer`](https://pkg.go.dev/io#Writer) interfaces to abstract the functionality of I/O implementations into public interfaces.\n\n`Reader` is the interface that wraps the basic `Read` method:\n\n```go\ntype Reader interface {\n\tRead(p []byte) (n int, err error)\n}\n```\n\n`Writer` is the interface that wraps the basic `Write` method.\n\n```go\ntype Writer interface {\n\tWrite(p []byte) (n int, err error)\n}\n```\n\nFor example, [`os`](https://pkg.go.dev/os) package provides an implementation of reading a file. `File` type implements `Reader` and `Writer` interfaces by defining basic [`Read`](https://pkg.go.dev/os#File.Read) and [`Write`](https://pkg.go.dev/os#File.Write) functions.\n\nIn this blog post, you'll learn how to compose Readers and Writers in Golang applications.\n\nFirst, let's read from a file and write its content to [`os.Stdout`](https://cs.opensource.google/go/go/+/master:src/os/file.go;l=66?q=Stdout&ss=go%2Fgo).\n\n```go\nfunc main() {\n\tfile, err := os.Open(\"data.txt\")\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tdefer file.Close()\n\n\tp := make([]byte, 32 * 1024)\n\tfor {\n\t\tn, err := file.Read(p)\n\n\t\t_, errW := os.Stdout.Write(p[:n])\n\t\tif errW != nil {\n\t\t\tlog.Fatal(errW)\n\t\t}\n\n\t\tif err != nil {\n\t\t\tif errors.Is(err, io.EOF) {\n\t\t\t\tbreak\n\t\t\t}\n\n\t\t\tlog.Fatal(err)\n\t\t}\n\t}\n}\n```\n\nEach call of the `Read` function fills the buffer `p` with the content from the file, i.e. the file is being consumed in chunks (up to `32KB`) instead of being fully loaded into the memory.\n\nTo simplify this widely used pattern, `io` package conveniently provides [`Copy`](https://pkg.go.dev/io#Copy) function that allows passing content from any `Reader` to any `Writer` and also [handles](https://cs.opensource.google/go/go/+/refs/tags/go1.21.0:src/io/io.go;l=433) additional edge cases.\n\n```go\nfunc main() {\n\tfile, err := os.Open(\"data.txt\")\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tdefer file.Close()\n\n\tif _, err := io.Copy(os.Stdout, file); err != nil {\n\t\tlog.Fatal(err)\n\t}\n}\n```\n\n`Reader` and `Writer` interfaces are used across the whole Golang ecosystem because they facilitate reading and writing content in a streaming fashion. Therefore, gluing together the Readers and Writers with the functions that expect these interfaces as arguments is a frequent problem to solve. Sometimes it's as straightforward as passing content from a Reader into a Writer, but sometimes the content written into a Writer must be represented as a Reader or the content from a reader must be sent into multiple Writers. Let's have a closer look into different use cases and the examples of solving these types of problems in the `GitLab` codebase.\n\n## Reader -> Writer\n\n**Problem**\n\nWe need to pass content from a Reader into a Writer.\n\n![readers and writers - image 1](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099495/Blog/Content%20Images/Blog/Content%20Images/image1_aHR0cHM6_1750099494917.png)\n\n**Solution**\n\nThe problem can be solved by using [`io.Copy`](https://pkg.go.dev/io#Copy).\n\n```go\nfunc Copy(dst Writer, src Reader) (written int64, err error)\n```\n\n**Example**\n\n[`InfoRefs*`](https://gitlab.com/gitlab-org/gitlab/blob/57aafb6a886d05c15dd0fa372fb4f008bec014ea/workhorse/internal/gitaly/smarthttp.go#L18-35) Gitaly RPCs return a `Reader` and we want to [stream](https://gitlab.com/gitlab-org/gitlab/blob/57aafb6a886d05c15dd0fa372fb4f008bec014ea/workhorse/internal/git/info-refs.go#L78-80) its content to a user via HTTP response:\n\n```go\nfunc handleGetInfoRefsWithGitaly(ctx context.Context, responseWriter *HttpResponseWriter, a *api.Response, rpc, gitProtocol, encoding string) error {\n        ...\n        infoRefsResponseReader, err := smarthttp.InfoRefsResponseReader(ctx, &a.Repository, rpc, gitConfigOptions(a), gitProtocol)\n        ...\n        if _, err = io.Copy(w, infoRefsResponseReader); err != nil {\n            return err\n        }\n        ...\n}\n```\n\n## Reader -> Multiple Writers\n\n**Problem**\n\nWe need to pass content from a Reader into multiple Writers.\n\n![readers and writers - image 3](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099495/Blog/Content%20Images/Blog/Content%20Images/image3_aHR0cHM6_1750099494917.png)\n\n**Solution**\n\nThe `io` package provides [`io.MultiWriter`](https://pkg.go.dev/io#MultiWriter) function that _converts_ multiple Writers into a single one. When its `Write` function is called, the content is copied to all the Writers ([implementation](https://cs.opensource.google/go/go/+/refs/tags/go1.21.0:src/io/multi.go;l=127)).\n\n```go\nfunc MultiWriter(writers ...Writer) Writer\n```\n\n**Example**\n\nGiven we want to [build](https://gitlab.com/gitlab-org/gitlab/blob/57aafb6a886d05c15dd0fa372fb4f008bec014ea/workhorse/internal/upload/destination/multi_hash.go#L13-18) `md5`, `sha1`, `sha256` and `sha512` hashes from the same content. [`Hash`](https://pkg.go.dev/hash#Hash) type is a `Writer`. Using `io.MultiWriter`, we define [`multiHash`](https://gitlab.com/gitlab-org/gitlab/blob/57aafb6a886d05c15dd0fa372fb4f008bec014ea/workhorse/internal/upload/destination/multi_hash.go#L43-61) Writer. After the content is [written](https://gitlab.com/gitlab-org/gitlab/blob/57aafb6a886d05c15dd0fa372fb4f008bec014ea/workhorse/internal/upload/destination/destination.go#L124-125) to the `multiHash`, we [calculate](https://gitlab.com/gitlab-org/gitlab/blob/57aafb6a886d05c15dd0fa372fb4f008bec014ea/workhorse/internal/upload/destination/multi_hash.go#L63-70) the hashes of all these functions in a single run.\n\nThe simplified version of the example is:\n\n```go\npackage main\n\nimport (\n\t\"crypto/sha1\"\n\t\"crypto/sha256\"\n\t\"fmt\"\n\t\"io\"\n\t\"log\"\n)\n\nfunc main() {\n\ts1 := sha1.New()\n\ts256 := sha256.New()\n\n\tw := io.MultiWriter(s1, s256)\n\tif _, err := w.Write([]byte(\"content\")); err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\tfmt.Println(s1.Sum(nil))\n\tfmt.Println(s256.Sum(nil))\n}\n```\n\nFor simplicity, we just call `Write` function on a Writer, but when content comes from a Reader, then `io.Copy` can be used as well:\n\n```go\n_, err := io.Copy(io.MultiWriter(s1, s256), reader)\n```\n\n## Multiple Readers -> Reader\n\n**Problem**\n\nWe have multiple Readers and need to sequentially read from them.\n\n![readers and writers - image 4](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099495/Blog/Content%20Images/Blog/Content%20Images/image4_aHR0cHM6_1750099494919.png)\n\n**Solution**\n\nThe `io` package provides [`io.MultiReader`](https://pkg.go.dev/io#MultiReader) function that _converts_ multiple Readers into a single one. The Readers are read in the passed order.\n\n```go\nfunc MultiReader(readers ...Reader) Reader\n```\n\nThen this Reader can be used in any function that accepts `Reader` as an argument.\n\n**Example**\n\nWorkhorse [reads](https://gitlab.com/gitlab-org/gitlab/blob/d97ce3baab7fbf459728ce18766fefd3abb8892f/workhorse/cmd/gitlab-resize-image/png/reader.go#L26-38) the first `N` bytes of an image to detect whether it's a PNG file and _puts them back_ by building a Reader from multiple Readers:\n\n```go\nfunc NewReader(r io.Reader) (io.Reader, error) {\n\tmagicBytes, err := readMagic(r)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tif string(magicBytes) != pngMagic {\n\t\tdebug(\"Not a PNG - read file unchanged\")\n\t\treturn io.MultiReader(bytes.NewReader(magicBytes), r), nil\n\t}\n\n\treturn io.MultiReader(bytes.NewReader(magicBytes), &Reader{underlying: r}), nil\n}\n```\n\n## Multiple Readers -> Multiple Writers\n\n**Problem**\n\nWe need to pass content from multiple Readers into multiple Writers.\n\n![readers and writers - image 6](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099495/Blog/Content%20Images/Blog/Content%20Images/image6_aHR0cHM6_1750099494921.png)\n\n**Solution**\n\nThe solutions above can be generalized on the many-to-many use case.\n\n```go\n_, err := io.Copy(io.MultiWriter(w1, w2, w3), io.MultiReader(r1, r2, r3))\n```\n\n## Reader -> Reader + Writer\n\n**Problem**\n\nWe need to read content from a Reader or pass the Reader to a function and simultaneously write the content into a Writer.\n\n![readers and writers - image 2](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099495/Blog/Content%20Images/Blog/Content%20Images/image2_aHR0cHM6_1750099494923.png)\n\n**Solution**\n\nThe `io` package provides [io.TeeReader](https://pkg.go.dev/io#TeeReader) function that accepts a Reader to read from, a Writer to write to, and returns a Reader that can be processed further.\n\n```go\nfunc TeeReader(r Reader, w Writer) Reader\n```\n\nThe [implementation](https://cs.opensource.google/go/go/+/refs/tags/go1.21.4:src/io/io.go;l=610) of the functionality is straightforward. The passed `Reader` and `Writer` are stored in a structure that is a `Reader` itself:\n\n```go\nfunc TeeReader(r Reader, w Writer) Reader {\n\treturn &teeReader{r, w}\n}\n\ntype teeReader struct {\n\tr Reader\n\tw Writer\n}\n```\n\nThe `Read` function implemented for the structure delegates the `Read` to the passed `Reader` and also performs a `Write` to the passed `Writer`:\n\n```\nfunc (t *teeReader) Read(p []byte) (n int, err error) {\n\tn, err = t.r.Read(p)\n\tif n > 0 {\n\t\tif n, err := t.w.Write(p[:n]); err != nil {\n\t\t\treturn n, err\n\t\t}\n\t}\n\treturn\n}\n```\n\n**Example 1**\n\nWe already touched hashing topic in the `Multiple Writers -> Writer` section and `io.TeeReader` is [used](https://gitlab.com/gitlab-org/gitlab/blob/d97ce3baab7fbf459728ce18766fefd3abb8892f/workhorse/internal/upload/destination/destination.go#L124-125) to provide a Writer to create a hash from content. The returned Reader can be further used to upload content to object storage.\n\n**Example 2**\n\nWorkhorse uses `io.TeeReader` to [implement](https://gitlab.com/gitlab-org/gitlab/blob/d97ce3baab7fbf459728ce18766fefd3abb8892f/workhorse/internal/dependencyproxy/dependencyproxy.go#L57-101) Dependency Proxy [functionality](https://docs.gitlab.com/ee/user/packages/dependency_proxy/). Dependency Proxy caches requested upstream images in the object storage. The not-yet-cached use case has the following behavior:\n\n- A user performs an HTTP request.\n- The upstream image is fetched using [`net/http`](https://pkg.go.dev/net/http) and [`http.Response`](https://pkg.go.dev/net/http#Response) provides its content via `Body` field, which is [`io.ReadCloser`](https://pkg.go.dev/io#ReadCloser) (basically an `io.Reader`).\n- We need to send this content back to the user by writing it into [`http.ResponseWriter`](https://pkg.go.dev/net/http#ResponseWriter) (basically an `io.Writer`).\n- We need to simultaniously upload the content to object storage by performing an [`http.Request`](https://pkg.go.dev/net/http#NewRequest) (a function that accepts an `io.Reader`).\n\nAs a result, `io.TeeReader` can be used to glue these primitives together:\n\n```go\nfunc (p *Injector) Inject(w http.ResponseWriter, r *http.Request, sendData string) {\n\t// Fetch upstream data via HTTP\n\tdependencyResponse, err := p.fetchUrl(r.Context(), sendData)\n\t...\n\t// Create a tee reader. Each Read will read from dependencyResponse.Body and simultaneously\n        // perform a Write to w writer\n\tteeReader := io.TeeReader(dependencyResponse.Body, w)\n\t// Pass the tee reader as the body of an HTTP request to upload it to object storage\n\tsaveFileRequest, err := http.NewRequestWithContext(r.Context(), \"POST\", r.URL.String()+\"/upload\", teeReader)\n\t...\n\tnrw := &nullResponseWriter{header: make(http.Header)}\n\tp.uploadHandler.ServeHTTP(nrw, saveFileRequest)\n\t...\n```\n\n## Writer -> Reader\n\n**Problem**\n\nWe have a function that accepts a Writer, and we are interested in the content that the function would write into the Writer. We want to intercept the content and represent it as a Reader to further process it in a streaming fashion.\n\n![readers and writers - image 5](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099495/Blog/Content%20Images/Blog/Content%20Images/image5_aHR0cHM6_1750099494924.png)\n\n**Solution**\n\nThe `io` package provides [`io.Pipe`](https://pkg.go.dev/io#Pipe) function that returns a Reader and a Writer:\n\n```go\nfunc Pipe() (*PipeReader, *PipeWriter)\n```\n\nThe Writer can be used to be passed to the function that accepts a Writer. All the content that has been written into it will be accessible via the reader, i.e. a synchronous in-memory pipe is created that can be used to connect code expecting an `io.Reader` with code expecting an `io.Writer`.\n\n**Example 1**\n\nFor [LSIF](https://lsif.dev/) file [transformation](https://gitlab.com/gitlab-org/gitlab/blob/d97ce3baab7fbf459728ce18766fefd3abb8892f/workhorse/internal/lsif_transformer/parser/parser.go#L68-72) for code navigation we need to:\n\n- [Read](https://gitlab.com/gitlab-org/gitlab/blob/d97ce3baab7fbf459728ce18766fefd3abb8892f/workhorse/internal/lsif_transformer/parser/parser.go#L48-51) content of a zip file.\n- Transform the content and [serialize](https://gitlab.com/gitlab-org/gitlab/blob/d97ce3baab7fbf459728ce18766fefd3abb8892f/workhorse/internal/lsif_transformer/parser/docs.go#L97-112) it into [`zip.Writer`](https://pkg.go.dev/archive/zip#Writer).\n- [Represent](https://gitlab.com/gitlab-org/gitlab/blob/d97ce3baab7fbf459728ce18766fefd3abb8892f/workhorse/internal/lsif_transformer/parser/parser.go#L68-72) the new compressed content as a Reader to be further processed in a streaming fashion.\n\nThe [`zip.NewWriter`](https://pkg.go.dev/archive/zip#NewWriter) function accepts a Writer to which it will write the compressed content. It is handy when we need to pass an open file descriptor to the function to save the content to the file. However, when we need to pass the compressed content via an HTTP request, we need to represent the data as a Reader.\n\n```go\n// The `io.Pipe()` creates a reader and a writer.\npr, pw := io.Pipe()\n\n// The writer is passed to `parser.transform` function which will write\n// the transformed compressed content into it\n// The writing should happen asynchronously in a goroutine because each `Write` to\n// the `PipeWriter` blocks until it has satisfied one or more `Read`s from the `PipeReader`.\ngo parser.transform(pw)\n\n// Everything that has been written into it is now accessible via the reader.\nparser := &Parser{\n\tDocs: docs,\n\tpr:   pr,\n}\n\n// pr is a reader that can be used to read all the data written to the pw writer\nreturn parser, nil\n```\n\n**Example 2**\n\nFor Geo setups [GitLab Shell](https://gitlab.com/gitlab-org/gitlab-shell) proxies all `git push` operations to secondary and redirects them to primary.\n\n- GitLab Shell establishes an SSH connection and defines [`ReadWriter`](https://gitlab.com/gitlab-org/gitlab-shell/blob/7898d8e69daf51a7b6e01052c4516ca70893a2d4/internal/command/readwriter/readwriter.go#L6-7) struct that has `In` field of `io.Reader` type to read data from a user and `Out` field of `io.Writer` type to send response to the user.\n- GitLab Shell performs an HTTP request to `/info/refs` and sends `response.Body` of type `io.Reader` to the user using [`io.Copy`](https://gitlab.com/gitlab-org/gitlab-shell/blob/7898d8e69daf51a7b6e01052c4516ca70893a2d4/internal/command/githttp/push.go#L60)\n- The user reacts to this response by sending data to `In` and GitLab Shell needs to read this data, convert it to a request expected by Git HTTP, and send it as an HTTP request to `/git-receive-pack`. This is where `io.Pipe` becomes useful.\n\n```go\nfunc (c *PushCommand) requestReceivePack(ctx context.Context, client *git.Client) error {\n\t// Define pipeReader and pipeWriter and use pipeWriter to collect all the data\n\t//sent by the user converted to a format expected by Git HTTP.\n\tpipeReader, pipeWriter := io.Pipe()\n\t// The writing happens asynchronously because it's a blocking operation\n\tgo c.readFromStdin(pipeWriter)\n\n\t// pipeReader can be passed as io.Reader and used to read all the data written to pipeWriter\n\tresponse, err := client.ReceivePack(ctx, pipeReader)\n\t...\n\t_, err = io.Copy(c.ReadWriter.Out, response.Body)\n\t...\n}\n\nfunc (c *PushCommand) readFromStdin(pw *io.PipeWriter) {\n\tvar needsPackData bool\n\n\t// Scanner reads the user input line by line\n\tscanner := pktline.NewScanner(c.ReadWriter.In)\n\tfor scanner.Scan() {\n\t\tline := scanner.Bytes()\n\t\t// And writes it to the pipe writer\n\t\tpw.Write(line)\n\t\t...\n\t}\n\n\t// The data that hasn't been processed by a scanner is copied if necessary\n\tif needsPackData {\n\t\tio.Copy(pw, c.ReadWriter.In)\n\t}\n\n\t// Close the pipe writer to signify EOF for the pipe reader\n\tpw.Close()\n}\n```\n\n## Try Golang\n\nGolang provides elegant patterns designed to efficiently process data in a streaming fashion. The patterns can be used to address new challenges or refactor the existing performance issues associated with high memory consumption.\n\n> Learn more about [GitLab and Golang](https://docs.gitlab.com/ee/development/go_guide/).\n",[755,757,754,9],{"slug":803,"featured":6,"template":688},"compose-readers-and-writers-in-golang-applications","content:en-us:blog:compose-readers-and-writers-in-golang-applications.yml","Compose Readers And Writers In Golang Applications","en-us/blog/compose-readers-and-writers-in-golang-applications.yml","en-us/blog/compose-readers-and-writers-in-golang-applications",{"_path":809,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":810,"content":816,"config":824,"_id":826,"_type":13,"title":827,"_source":15,"_file":828,"_stem":829,"_extension":18},"/en-us/blog/developing-gitlab-duo-ai-impact-analytics-dashboard-measures-the-roi-of-ai",{"title":811,"description":812,"ogTitle":811,"ogDescription":812,"noIndex":6,"ogImage":813,"ogUrl":814,"ogSiteName":672,"ogType":673,"canonicalUrls":814,"schema":815},"Developing GitLab Duo: AI Impact analytics dashboard measures the ROI of AI","Our blog series continues spotlighting a new feature that provides detailed metrics, such as the Code Suggestions Usage Rate, to help understand the effectiveness of AI investments.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098611/Blog/Hero%20Images/Blog/Hero%20Images/blog-hero-banner-1-0178-820x470-fy25_7JlF3WlEkswGQbcTe8DOTB_1750098611370.png","https://about.gitlab.com/blog/developing-gitlab-duo-ai-impact-analytics-dashboard-measures-the-roi-of-ai","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Developing GitLab Duo: AI Impact analytics dashboard measures the ROI of AI\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Haim Snir\"}],\n        \"datePublished\": \"2024-05-15\",\n      }",{"title":811,"description":812,"authors":817,"heroImage":813,"date":819,"body":820,"category":821,"tags":822},[818],"Haim Snir","2024-05-15","***Generative AI marks a monumental shift in the software development industry, making it easier to develop, secure, and operate software. Our new blog series, written by our product and engineering teams, gives you an inside look at how we create, test, and deploy the AI features you need integrated throughout the enterprise. Get to know new capabilities within GitLab Duo and how they will help DevSecOps teams deliver better results for customers.***\n\nAs organizations adopt [GitLab Duo](https://about.gitlab.com/gitlab-duo/), our suite of AI features to power DevSecOps workflows, business and engineering leaders need real-time visibility into the technology's ROI. Granular usage data, performance improvements, the trade-off between speed, security, and quality, and other [productivity metrics](https://about.gitlab.com/blog/measuring-ai-effectiveness-beyond-developer-productivity-metrics/) are essential to evaluate the effectiveness of AI in software development. That's why we created the AI Impact analytics dashboard for GitLab Duo, available in GitLab 17.0, as a new way to measure the ROI of AI.\n\n> [Take an interactive tour of the AI Impact analytics dashboard](https://gitlab.navattic.com/ai-impact).\n\n## Understanding the ROI of GitLab Duo AI-powered capabilities\n\nTo properly evaluate AI's impact on the software development lifecycle, organizations have told us they want to:\n- visualize which metrics improved as a result of investments in AI\n- compare the performance of teams that are using AI against teams that are not using AI\n- track the progress of AI adoption\n- automate insights extraction from a large volume of performance data\n\nAI Impact analytics dashboard features these capabilities and more with customizable visualization, which enables teams to:\n- **Monitor AI adoption:** Observing AI adoption rates enables organizations to evaluate organizational strategies to maximize the ROI on their technology investments. \n- **Track performance improvements:** By tracking performance metrics and observing changes after the adoption of AI, leaders can quickly assess the benefits and business value of AI features.\n\n## What is the AI Impact analytics dashboard?\n\nIn this first release of the AI Impact analytics dashboard, we focus on providing insights and metrics about GitLab Duo Code Suggestions adoption, including:\n\n- **Detailed usage metrics:** Discover the ratio of monthly Code Suggestions usage compared to the total number of unique code contributors to know how deeply Code Suggestions is adopted within your teams.\n- **Correlation observations:** Examine how trends in AI usage within a project or across a group influence other crucial productivity metrics, displayed for the current month and the trailing six months. \n    - For this correlation analysis we added a new metric \"Code Suggestions Usage Rate\" as the Independent Variable (the cause). The monthly Code Suggestions Usage Rate is calculated as the number of monthly unique Code Suggestions users divided by total monthly unique [contributors](https://docs.gitlab.com/ee/user/profile/contributions_calendar.html#user-contribution-events). GitLab considers the total monthly unique code contributors, which means only users with pushed events are included in the calculation.\n    - As Dependent Variables (the effect), we added these [performance metrics](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html#dashboard-metrics-and-drill-down-reports): Cycle Time, Lead Time and Deployment Frequency. And as [Quality and Security Metrics](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html#dashboard-metrics-and-drill-down-reports), we added Change Failure Rate and Critical Vulnerabilities. \n- **Comparison view:**  Understand the difference in the performance of teams that are and are not using AI, and manage the trade-off between speed, quality, and security exposure.\n\n![Comparison of AI usage and SDLC performance](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098621/Blog/Content%20Images/Blog/Content%20Images/image4_aHR0cHM6_1750098620998.png)\n\n## What’s next for the AI Impact analytics dashboard?\n\nLooking ahead, we have exciting plans to expand the capabilities of the AI Impact analytics dashboard. Here are some of the highlights:\n\n1. New tile visualizations such as \"GitLab Duo Seats: Assigned and Used,\" \"Code Suggestions: Acceptance Rate %,\" and \"GitLab Duo Chat: Unique Users\"  to gain a deeper insight into usage patterns for GitLab Duo.\n\n![AI Impact - image 2](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098621/Blog/Content%20Images/Blog/Content%20Images/Screenshot_2024-07-17_at_12.50.31_aHR0cHM6_1750098620999.png)\n\n2. New comparison bar chart to help users observe how changes in one metric correlate with changes in others:\n\n![AI Impact comparison bar chart](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098621/Blog/Content%20Images/Blog/Content%20Images/image3_aHR0cHM6_1750098621000.png)\n\n3. AI statistics in the [Contribution analytics report](https://docs.gitlab.com/ee/user/group/contribution_analytics/index.html) to understand how users interact with AI features. See which users are leveraging AI features and whether their performance has changed over time:\n\n![Contribution analytics report](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098621/Blog/Content%20Images/Blog/Content%20Images/image1_aHR0cHM6_1750098621001.png)\n\n## Get started today\n\nWe're excited about the potential of the AI Impact analytics dashboard to not only demonstrate the real-world business outcomes of AI but also to drive more informed decisions regarding future AI as optimization for the DevSecOps lifecycle. For more information about what is coming next and to share feedback or questions, [please visit our AI Impact analytics dashboard epic](https://gitlab.com/groups/gitlab-org/-/epics/12978).\n\nStart your [free trial of GitLab Duo and the AI Impact analytics dashboard today](https://about.gitlab.com/gitlab-duo/#free-trial).\n\n## Read more of the \"Developing GitLab Duo\" series\n\n- [Developing GitLab Duo: How we validate and test AI models at scale](https://about.gitlab.com/blog/developing-gitlab-duo-how-we-validate-and-test-ai-models-at-scale/)\n- [Developing GitLab Duo: How we are dogfooding our AI features](https://about.gitlab.com/blog/developing-gitlab-duo-how-we-are-dogfooding-our-ai-features/)\n- [Developing GitLab Duo: Secure and thoroughly test AI-generated code](https://about.gitlab.com/blog/how-gitlab-duo-helps-secure-and-thoroughly-test-ai-generated-code/)\n- [Developing GitLab Duo: Blending AI and Root Cause Analysis to fix CI/CD pipelines](https://about.gitlab.com/blog/developing-gitlab-duo-blending-ai-and-root-cause-analysis-to-fix-ci-cd/)\n\n_Disclaimer: This blog contains information related to upcoming products, features, and functionality. It is important to note that the information in this blog post is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab._","ai-ml",[759,9,823],"features",{"slug":825,"featured":90,"template":688},"developing-gitlab-duo-ai-impact-analytics-dashboard-measures-the-roi-of-ai","content:en-us:blog:developing-gitlab-duo-ai-impact-analytics-dashboard-measures-the-roi-of-ai.yml","Developing Gitlab Duo Ai Impact Analytics Dashboard Measures The Roi Of Ai","en-us/blog/developing-gitlab-duo-ai-impact-analytics-dashboard-measures-the-roi-of-ai.yml","en-us/blog/developing-gitlab-duo-ai-impact-analytics-dashboard-measures-the-roi-of-ai",{"_path":831,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":832,"content":838,"config":844,"_id":846,"_type":13,"title":847,"_source":15,"_file":848,"_stem":849,"_extension":18},"/en-us/blog/getting-started-with-value-streams-dashboard",{"title":833,"description":834,"ogTitle":833,"ogDescription":834,"noIndex":6,"ogImage":835,"ogUrl":836,"ogSiteName":672,"ogType":673,"canonicalUrls":836,"schema":837},"Getting started with the new GitLab Value Streams Dashboard","Benchmark your value stream lifecycle, DORA, and vulnerabilities metrics to gain valuable insights and uncover patterns for continuous improvements.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749671793/Blog/Hero%20Images/16_0-cover-image.png","https://about.gitlab.com/blog/getting-started-with-value-streams-dashboard","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Getting started with the new GitLab Value Streams Dashboard\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Haim Snir\"}],\n        \"datePublished\": \"2023-06-12\",\n      }",{"title":833,"description":834,"authors":839,"heroImage":835,"date":840,"body":841,"category":681,"tags":842},[818],"2023-06-12","\n\n\u003Ci>This is part two of our multipart series introducing you to the capabilities within GitLab Value Stream Management and the Value Streams Dashboard. In part one, [learn about the Total Time Chart](https://about.gitlab.com/blog/value-stream-total-time-chart/) and how to simplify top-down optimization flow with Value Stream Management.\u003C/i>\n\nGetting started with GitLab [Value Streams Dashboard](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html), a customizable dashboard that enables decision-makers to identify trends, patterns, and opportunities for digital transformation improvements, is easy. If you're already using GitLab Value Stream Management, simply navigate to your project's or group's Analytics tab, and within [Value stream analytics](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#view-value-stream-analytics), click on the \"Value Streams Dashboard - DORA\" link. This will open a new page with the Value Streams Dashboard.\n\n![image of DORA Metrics console](https://about.gitlab.com/images/blogimages/vsdCover.png){: .shadow}\nDORA metrics comparison panel\n{: .note.text-center}\n\nGitLab Value Stream Management allows customers to visualize their end-to-end DevSecOps workstreams, manage their software development processes, and gain insight into how digital transformation and technological investments are delivering value and driving business results. GitLab Value Stream Management is able to do this because GitLab provides an entire DevOps platform as a single application and, therefore, holds all the data needed to provide end-to-end visibility throughout the entire software development lifecycle. So now your decisions rely on actual data rather than blind estimation or gut feelings. Additionally, because GitLab is the place where work happens, GitLab Value Stream Management insights are also actionable, allowing your users to move from \"understanding\" to \"fixing\" at any time, from within their workflow and without losing context.\n\nThe centralized UI in Value Streams Dashboard acts as the single source of truth (SSOT), where all stakeholders can access and view the same set of metrics that are relevant to the organization. The SSOT views ensure consistency, eliminate discrepancies, and provide a reliable and unified source of data for decision-making and analysis.\n\nThe first iteration of the GitLab Value Streams Dashboard was focused on enabling teams to continuously improve software delivery workflows by benchmarking [value stream lifecycle metrics, DORA metrics, and vulnerabilities metrics](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html#dashboard-metrics-and-drill-down-reports). One of the key features is a new DevSecOps metrics comparison panel that displays the metrics for a group or project in the month-to-date, last month, the month before, and the past 180 days.\n\nThis comparison enables managers to track team improvements in the context of the other DevSecOps metrics to find patterns or trends over time. The data is presented in a clear and concise manner, ensuring that you can quickly grasp the significance of the metrics.\n\n![The Value Streams Dashboard helps you get a high-level custom view over multiple DevOps metrics and understand whether they are improving month-over-month](https://about.gitlab.com/images/blogimages/2023-05-18_vsd_1.gif){: .shadow}\nValue Streams Dashboard metrics comparison panel\n{: .note.text-center}\n\nAdditionally, from each metric you can drill down to a detailed report to investigate the underlying data, understand what is affecting the team performance, and identify actionable insights.\n\nWe understand that every organization has its own set of subgroups and projects, each with specific processes and terminology. That's why we designed our dashboard to be flexible and adaptable. Users have the power to [customize](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html#customize-the-dashboard-panels) their dashboard by including panels from different subgroups or projects. \n\nTracking and comparing these metrics over a period of time helps teams catch downward trends early, drill down into individual projects/metrics, take remedial actions to maintain their software delivery performance, and track progress of their innovation investments. Value Streams Dashboard's intuitive interface reduces the learning curve and eliminates the need for extensive training. Everyone can now immediately leverage the platform's unified data store power, maximizing their productivity and saving precious time and resources.\n\n## Value Streams Dashboard roadmap\nWe are just getting started with delivering new capabilities in our Value Streams Dashboard. The roadmap includes planned features and functionality that will continue to improve decision-making and operational efficiencies.\n\nSome of the capabilities we plan to focus on next include:\n\n- adding an [executive-level summary](https://gitlab.com/groups/gitlab-org/-/epics/9558) of key metrics related to software performance and flow of value across the organization\n- adding a [\"DORA Performers score\"](https://gitlab.com/groups/gitlab-org/-/epics/10416) panel with the DORA metrics health from all the organization's groups and projects\n- adding [filter by label to the comparison panel](https://gitlab.com/gitlab-org/gitlab/-/issues/388890) - we recognize that every team does not follow the same flow so we are adding them to slice and dice the dashboard views with GitLab labels as filters\n\nTo help us improve the Value Stream Management Dashboard, please share feedback about your experience in this [survey](https://gitlab.fra1.qualtrics.com/jfe/form/SV_50guMGNU2HhLeT4).\n\n## Learn more\n* Find out what's next on the [Value Stream Management direction page](https://about.gitlab.com/direction/plan/value_stream_management/#whats-next-and-why).\n\n* Learn how to use the new dashboard using the [Value Streams Dashboard documentation](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html).\n\n* Watch this short video on Value Streams Dashboards:\n\n\u003Ciframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/EA9Sbks27g4\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen>\u003C/iframe>\n\nCheck out part three of this multipart series: \"[GitLab's 3 steps to optimizing software value streams](https://about.gitlab.com/blog/three-steps-to-optimize-software-value-streams/)\".\n\n\u003Ci>Disclaimer: This blog contains information related to upcoming products, features, and functionality. It is important to note that the information in this blog post is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab.\u003C/i>\n",[755,823,9,732,843],"agile",{"slug":845,"featured":6,"template":688},"getting-started-with-value-streams-dashboard","content:en-us:blog:getting-started-with-value-streams-dashboard.yml","Getting Started With Value Streams Dashboard","en-us/blog/getting-started-with-value-streams-dashboard.yml","en-us/blog/getting-started-with-value-streams-dashboard",{"_path":851,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":852,"content":858,"config":865,"_id":867,"_type":13,"title":868,"_source":15,"_file":869,"_stem":870,"_extension":18},"/en-us/blog/git-fetch-performance-2021-part-2",{"title":853,"description":854,"ogTitle":853,"ogDescription":854,"noIndex":6,"ogImage":855,"ogUrl":856,"ogSiteName":672,"ogType":673,"canonicalUrls":856,"schema":857},"Git fetch performance improvements in 2021, Part 2 ","Looking back at the server-side performance improvements we made in 2021 for Git fetch.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749663383/Blog/Hero%20Images/tanuki-bg-full.png","https://about.gitlab.com/blog/git-fetch-performance-2021-part-2","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Git fetch performance improvements in 2021, Part 2 \",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Jacob Vosmaer\"}],\n        \"datePublished\": \"2022-02-07\",\n      }",{"title":853,"description":854,"authors":859,"heroImage":855,"date":861,"body":862,"category":681,"tags":863},[860],"Jacob Vosmaer","2022-02-07","\nIn [Part 1](/blog/git-fetch-performance/) of this two-part series, we looked at how much server-side Git fetch performance, especially for CI, has improved in GitLab in 2021. Now, we will discuss how we achieved this.\n\n## Recap of Part 1\n-   In December 2019, we set up custom CI fetch caching automation for\n   `gitlab-org/gitlab`, which we internally called \"the CI pre-clone\n   script\".\n-   In December 2020, we encountered some production incidents on GitLab.com,\n   which highlighted that the CI pre-clone script had become critical\n   infrastructure but, at the same time, it had not yet matured beyond\n   a custom one-off solution.\n-   Over the course of 2021, we built an alternative caching solution\n   for CI Git fetch traffic called the pack-objects cache. In Part 1,\n   we discussed a benchmark simulating CI fetch traffic which shows\n   that the pack-objects cache combined with other efficiency\n   improvements reduced GitLab server CPU consumption 9x compared to\n   the baseline of December 2020.\n\n## The pack-objects cache\n\nAs discussed in Part 1, what we realized through the\nproduction incidents in December 2020 was that the CI pre-clone script\nfor `gitlab-org/gitlab` had become a critical piece of infrastructure.\nAt the same time, it benefited only one Git repository on GitLab.com,\nand it was not very robust. It would be much better to have an\nintegrated solution that benefits all repositories. We achieved this\ngoal by building the [pack-objects cache](https://docs.gitlab.com/ee/administration/gitaly/configure_gitaly.html#pack-objects-cache).\n\nThe name \"pack-objects cache\" refers to `git pack-objects`, which is\nthe Git [subcommand](https://git-scm.com/docs/git-pack-objects) that\nimplements the [packfile](https://git-scm.com/book/en/v2/Git-Internals-Packfiles) compression algorithm. As this [Git commit message from Jeff King](https://gitlab.com/gitlab-org/gitlab-git/-/commit/20b20a22f8f7c1420e259c97ef790cb93091f475) explains, `git pack-objects` is a good candidate for a CI fetch cache.\n\n> You may want to insert a caching layer around\n> pack-objects; it is the most CPU- and memory-intensive\n> part of serving a fetch, and its output is a pure\n> function of its input, making it an ideal place to\n> consolidate identical requests.\n\nThe pack-objects cache is GitLab's take on this \"caching layer\". It\ndeduplicates identical Git fetch requests that arrive within a short\ntime window.\n\nAt a high level, when serving a fetch, we buffer the output of `git\npack-objects` into a temporary file. If an identical request comes in,\nwe serve it from the buffer file instead of creating a new `git\npack-objects` process. After 5 minutes, we delete the buffer file. If\nyou want to know more about how exactly the cache is implemented, you\ncan look at the implementation\n([1](https://gitlab.com/gitlab-org/gitaly/-/blob/v14.6.3/internal/gitaly/service/hook/pack_objects.go),\n[2](https://gitlab.com/gitlab-org/gitaly/-/tree/v14.6.3/internal/streamcache)).\n\n![Architecture diagram](https://about.gitlab.com/images/blogimages/git-fetch-2021/pack-objects-cache-architecture.jpg)\n\nBecause the amount of space used by the cache files is bounded roughly\nby the eviction window (5 minutes) multiplied by the maximum network bandwidth\nof the Gitaly server, we don't have to worry about the cache using a\nlot of storage. In fact, on GitLab.com, we store the cache files on the\nsame disks that hold the repository data. We leave a safety margin of\nfree space on these disks at all times anyway, and the cache fits in\nthat safety margin comfortably.\n\nSimilarly, we also don't notice the increase disk input/output\noperations per second (IOPS) used by the cache on GitLab.com. There\nare two reasons for this. First of all, whenever we _read_ data from\nthe cache, it is usually still in the Linux page cache, so it gets\nserved from RAM. The cache barely does any disk read I/O operations.\nSecond, although the cache does do _write_ operations, these fit\ncomfortably within the maximum sustained IOPS rate supported by the\nGoogle Compute Engine persistent disks we use.\n\nThis leads us to a disadvantage of the pack-objects cache, which is\nthat it really does write a lot of data to disk. On GitLab.com, we saw\nthe disk write throughput jump up by an order of magnitude. You can\nsee this in the graph below, which shows disk writes for a single\nGitaly server with a busy, large repository on it: (the GitLab [company\nwebsite](https://gitlab.com/gitlab-com/www-gitlab-com)). You can\nclearly see the number of bytes written to disk per second jump up when we\nturned the cache on.\n\n![increased disk writes with cache enabled](https://about.gitlab.com/images/blogimages/git-fetch-2021/cache-disk-writes.jpg)\n\nThis increase in disk writes is not a problem for our infrastructure because we have the\nspare capacity, but we were not sure we could assume the same for all\nother GitLab installations in the world. Because of this, we decided\nto leave the pack-objects cache off by default.\n\nThis was a difficult decision because we think almost all GitLab\ninstallations would benefit from having this cache enabled. One of the\nreasons we are writing this blog post is to raise awareness that this\nfeature is available, so that self-managed GitLab administrators can\nopt in to using it.\n\nAgain, on the positive side, the cache did not introduce a new\npoint of failure on GitLab.com. If the `gitaly` service is running,\nand if the repository storage disk is available, then the cache is\navailable. There are no external dependencies. And if `gitaly` is not\nrunning, or the repository storage disk is unavailable, then the whole\nGitaly server is unavailable anyway.\n\nAnd finally, cache capacity grows naturally with the number of Gitaly\nservers. Because the cache is completely local to each Gitaly server,\nwe do not have to worry about whether the cache keeps working as we\ngrow GitLab.com.\n\nThe pack-objects cache was introduced in GitLab 13.11. In GitLab 14.5,\nwe made it a lot more efficient by optimizing its transport using Unix\nsockets\n([1](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3758),\n[2](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3759)). If\nyou want to [try out the pack-objects cache](https://docs.gitlab.com/ee/administration/gitaly/configure_gitaly.html#pack-objects-cache) on\nyour self-managed GitLab instance, we recommend that you upgrade to\nGitLab 14.5 or newer first.\n\n## Improved RPC transport for Git HTTP\n\nAfter we built the pack-objects cache, we were able to generate a much\nhigher volume of Git fetch responses on a single Gitaly server.\nHowever, we then found out that the RPC transport between the HTTP\nfront-end (GitLab Workhorse) and the Gitaly server became a\nbottleneck. We tried disabling the CI pre-clone script of\n`gitlab-org/gitlab` in April 2021 but we quickly had to turn it back\non because the increased volume of Git fetch data transfer was slowing\ndown the rest of Gitaly.\n\nThe fetch traffic was acting as a noisy neighbor to all the other\ntraffic on `gitlab-org/gitlab`. For each GitLab.com Gitaly server, we\nhave a request latency\n[SLI](https://sre.google/sre-book/service-level-objectives/). This is\na metric that observes request latencies for a selection of RPCs that\nwe expect to be fast, and it tracks how many requests for these RPCs\nare \"fast enough\". If the percentage of fast-enough requests drops\nbelow a certain threshold, we know we have a problem.\n\nWhen we disabled the pre-clone script, the network traffic to the\nGitaly server hosting `gitlab-org/gitlab` went up, as expected. What\nwent wrong was that the percentage of fast-enough requests started to\ndrop. This was not because the server had to serve up more data: The\nRPCs that serve the Git fetch data do not count towards the latency\nSLI.\n\nBelow you see two graphs from the day we tried disabling the CI\npre-clone script. First, see how the network traffic off of the Gitaly\nserver increased once we disabled the CI pre-clone script. This is\nbecause instead of pulling most of the data from object storage, and\nonly some of the data from Gitaly, the CI runners now started pulling\nall of the Git data they needed from Gitaly.\n\n![network peaks](https://about.gitlab.com/images/blogimages/git-fetch-2021/no-script-network-annotated.png)\n\nNow consider our Gitaly request latency SLI for this particular\nserver. For historical reasons, we call this \"Apdex\" in our dashboards.\nRecall that this SLI tracks the percentage of fast-enough requests from\na selection of Gitaly RPCs. The ideal number would be 100%. In the\ntime window where the CI pre-clone script was disabled, this graph\nspent more time below 99%, and it even dipped below 96% several times.\n\n![latency drops](https://about.gitlab.com/images/blogimages/git-fetch-2021/no-script-latency-annotated.png)\n\nEven though we could not explain what was going on, the latency SLI dips\nwere clear evidence that disabling the CI pre-clone script slowed down\nunrelated requests to this Gitaly server, to a point which is\nunacceptable. This was a setback for our plan to replace the CI pre-clone script.\n\nBecause we did not want to just give up, we set aside some time to try\nand understand what the bottleneck was, and if it could be\ncircumvented. The bad news is that we did not come up with a\nsatisfactory answer about what the bottleneck is. But the good news is\nthat we were able to circumvent it.\n\nBy building a simplified [prototype alternate RPC\ntransport](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1046),\nwe were able to find out that with the pack-objects cache, the\nhardware we run on and Git itself were able to serve up much more\ntraffic than we were able to get out of GitLab. We [never got to the\nbottom](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1024)\nof what was causing all the overhead but a likely suspect is the fact\nthat gRPC-Go allocates memory for each message it sends, and with Git\nfetch traffic we send a lot of messages. Gitaly was spending a lot of\ntime doing garbage collection.\n\nWe then had to decide how to improve the situation. Because we were\nuncertain if we could fix the apparent bottleneck in gRPC, and because\nwe were certain that we could go faster by not sending the Git fetch data\nthrough gRPC in the first place, we chose to do the latter. We created\nmodified versions of the RPCs that carry the bulk of the Git fetch\ndata. On the surface, the new versions are still gRPC methods. But\nduring a call, each will establish a side channel, and use that for\nthe bulk data transfer.\n\n![side channel diagram](https://about.gitlab.com/images/blogimages/git-fetch-2021/sidechannel.png)\n\nThis way we avoided making major changes to the structure of Gitaly:\nit is still a gRPC server application. Logging, metrics,\nauthentication, and other middleware work as normal on the optimized\nRPCs. But most of the data transfer happens on either Unix sockets (for localhost RPC calls) or [Yamux streams](https://github.com/hashicorp/yamux/) (for the regular RPC calls).\n\nBecause we have 6x more Git HTTP traffic than Git SSH traffic on\nGitLab.com, we decided to initially only optimize the transport for\nGit HTTP traffic. We are still working on [doing the same for Git\nSSH](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/652) because, even though Git HTTP efficiency is more important for\nGitLab.com than that of Git SSH, we know that for some self-managed\nGitLab instances it is the other way around.\n\nThe new server-side RPC transport for Git HTTP was released in GitLab\n14.5. There is no configuration required for this improved transport.\nRegardless of whether you use the pack-objects cache on your GitLab\ninstance, Gitaly, Workhorse, and Praefect all use less CPU to handle\nGit HTTP fetch requests now.\n\nThe payoff for this work came in October 2021 when we disabled the CI\npre-clone script for `gitlab-org/gitlab`, which did not cause any\nnoisy neighbor problems this time. We have had no issues since then\nserving the Git fetch traffic for that project.\n\n## Improvements to Git itself\n\nAside from the pack-objects cache and the new RPC transport between\nWorkhorse and Gitaly, we also saw some improvements because of changes\nin Git itself. We discovered a few inefficiencies which we\nreported to the Git mailing list and helped get fixed.\n\nOur main repository `gitlab-org/gitlab` has hundreds of thousands of [Git\nreferences](https://git-scm.com/book/en/v2/Git-Internals-Git-References). Looking at CPU profiles, we [noticed](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/400) that a lot of Git\nfetch time was spent on the server iterating over these references.\nThese references were not even being sent back to the client; Git was\njust scanning through all of them on the server twice for each CI job.\n\nIn both cases, the problem could be fixed by doing a scan over a\nsubset instead of a scan across all references. These two problems got fixed\n([1](https://gitlab.com/gitlab-org/gitlab-git/-/commit/b3970c702cb0acc0551d88a5f34ad4ad2e2a6d39), [2](https://gitlab.com/gitlab-org/gitlab-git/-/commit/be18153b975844f8792b03e337f1a4c86fe87531)) in Git 2.31.0, released in March 2021.\n\nLater on, we found a different problem, also in the reference-related\nworkload of Git fetch. As part of the fetch protocol, the server sends\na list of references to the client so that the client can update its\nlocal branches etc. It turned out that for each reference, Git was\ndoing 1 or 2 `write` system calls on the server. This led to [a lot of\noverhead](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1257), and this was made worse by our old RPC transport which could\nend up sending 1 RPC message per advertised Git reference.\n\nThis problem got fixed in Git itself by changing the functions that\nwrite the references to [use buffered\nIO](https://gitlab.com/gitlab-org/gitlab-git/-/commit/70afef5cdf29b5159f18df1b93722055f78740f8).\nThis change landed in Git 2.34.0, released in November 2021. Ahead of\nthat, it got shipped in GitLab 14.4 as a custom Git patch.\n\nFinally, we discovered that increasing the copy buffer size used by\n`git upload-pack` to relay `git pack-objects` output made both `git\nupload-pack` and [every link in the chain after\nit](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/4224) more\nefficient. This got fixed in Git by [increasing the buffer\nsize](https://gitlab.com/gitlab-org/gitlab-git/-/commit/55a9651d26a6b88c68445e7d6c9f511d1207cbd8).\nThis change is part of Git 2.35.0 and is included in GitLab 14.7, both\nof which were released in January 2022.\n\n## Summary\n\nIn Part 1, we showed that GitLab server performance when service CI Git fetch traffic has improved a lot in 2021. In this post, we explained that the improvements are due to:\n\n- The pack-objects cache\n- A more efficient Git data transport between server-side GitLab components\n- Efficiency improvements in Git itself\n\n## Thanks\n\nMany people have contributed to the work described in this blog post.\nI would like to specifically thank Quang-Minh Nguyen and Sean McGivern\nfrom the Scalability team, and Patrick Steinhardt and Sami Hiltunen\nfrom the Gitaly team.\n\n## Related content\n\n- Improvements to the client-side performance of `git fetch` (although GitLab is a server application, it sometimes acts as a Git client): [mirror fetches](https://gitlab.com/gitlab-org/git/-/issues/95), [fetches into repositories with many references](https://gitlab.com/gitlab-org/git/-/issues/94)\n- Improvements to server-side Git push performance: [consistency check improvements](https://gitlab.com/gitlab-org/git/-/issues/92)\n",[757,864,9],"production",{"slug":866,"featured":6,"template":688},"git-fetch-performance-2021-part-2","content:en-us:blog:git-fetch-performance-2021-part-2.yml","Git Fetch Performance 2021 Part 2","en-us/blog/git-fetch-performance-2021-part-2.yml","en-us/blog/git-fetch-performance-2021-part-2",{"_path":872,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":873,"content":879,"config":884,"_id":886,"_type":13,"title":887,"_source":15,"_file":888,"_stem":889,"_extension":18},"/en-us/blog/git-fetch-performance",{"title":874,"description":875,"ogTitle":874,"ogDescription":875,"noIndex":6,"ogImage":876,"ogUrl":877,"ogSiteName":672,"ogType":673,"canonicalUrls":877,"schema":878},"How we made Git fetch performance improvements in 2021, part 1","Our Scalability team tackled a server CPU utilization issue. Here's the first part of a detailed look at performance improvements we made for Git fetch.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749663397/Blog/Hero%20Images/logoforblogpost.jpg","https://about.gitlab.com/blog/git-fetch-performance","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we made Git fetch performance improvements in 2021, part 1\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Jacob Vosmaer\"}],\n        \"datePublished\": \"2022-01-20\",\n      }",{"title":874,"description":875,"authors":880,"heroImage":876,"date":881,"body":882,"category":681,"tags":883},[860],"2022-01-20","\nIn this post we look back on a series of projects from the Scalability\nteam that improved GitLab server-side efficiency for serving Git fetch\ntraffic. In the benchmark described below we saw a 9x reduction in\nGitLab server CPU utilization. Most of the performance comes from the\nGitaly pack-objects cache, which has proven very effective at reducing\nthe Gitaly server load caused by highly concurrent CI pipelines.\n\nThese changes are not user-visible but they benefit the stability and\navailability of GitLab.com. If you manage a GitLab instance\nyourself you may want to [enable the pack-objects\ncache](https://docs.gitlab.com/ee/administration/gitaly/configure_gitaly.html#pack-objects-cache)\non your instance too.\n\nWe discuss how we achieved these improvements in [part 2](/blog/git-fetch-performance-2021-part-2/).\n\n## Background\n\nWithin the GitLab application, Gitaly is the component that acts as a\nremote procedure call (RPC) server for Git repositories. On\nGitLab.com, repositories are stored on persistent disks attached to\ndedicated Gitaly servers, and the rest of the application accesses\nrepositories by making RPC calls to Gitaly.\n\nIn 2020 we encountered several incidents on GitLab.com caused by the fact that\nour Gitaly server infrastructure [could not\nhandle](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3013)\nthe Git fetch traffic generated by CI on our own main repository,\n[`gitlab-org/gitlab`](https://gitlab.com/gitlab-org/gitlab). The only reason the situation at the time worked\nwas because we had a custom CI caching solution for\n`gitlab-org/gitlab` only, commonly referred to as the \"CI pre-clone\nscript\".\n\n### The CI pre-clone script\n\nThe CI pre-clone script was an implementation of the [clone bundle CI\nfetching\nstrategy](https://www.kernel.org/best-way-to-do-linux-clones-for-your-ci.html).\nWe had originally set up the CI pre-clone script one year earlier, in\n[December 2019](https://gitlab.com/gitlab-org/gitlab/-/issues/39134).\nIt consisted of two parts.\n\n1.   A CI cron job that would clone `gitlab-org/gitlab`, pack up the\n   result into a tarball, and upload it to a known Google Cloud\n   Storage bucket.\n1.   A shell script snippet, stored in the `gitlab-org/gitlab` project settings, that was\n   injected into each `gitlab-org/gitlab` CI job. This shell script\n   would download and extract the latest tarball from the known URL.\n   After that the CI job did an incremental Git fetch, relative to the\n   tarball contents, to retrieve the actual CI pipeline commit.\n\nThis system was very effective. Our CI pipelines run against shallow\nGit clones of `gitlab-org/gitlab`, which require over 100MB of data to\nbe transfered per CI job. Because of the CI pre-clone script, the\namount of Git data per job was closer to 1MB. The rest of the data was\nalready there because of the tarball. The amount of repository data\ndownloaded by each CI job stayed the same, but only 1% of this data\nhad to come from a Gitaly server. This saved a lot of computation and\nbandwidth on the Gitaly server hosting `gitlab-org/gitlab`.\n\nAlthough this solution worked well, it had a number of downsides.\n\n1.   It was not part of the application and required per-project manual\n   set-up and maintenance.\n1.   It did not work for forks of `gitlab-org/gitlab`.\n1.   It had to be maintained in two places: the project that created the\n   tarball and the project settings of `gitlab-org/gitlab`.\n1.   We had no version control for the download script; this was just\n   text stored in the project's CI settings.\n1.   The download script was fragile. We had one case where we added an\n   `exit` statement in the wrong place, and all `gitlab-org/gitlab`\n   builds started silently using stale checkouts left behind by other\n   pipelines.\n1.   In case of a Google Cloud Storage outage, the full uncached traffic\n   would saturate the Gitaly server hosting `gitlab-org/gitlab`. Such\n   outages are rare but they do happen.\n1.   A user who would want to copy our solution would have to set up\n   their own Google Cloud Storage bucket and pay the bills for it.\n\nThe biggest issue really was that one year on, the CI pre-clone script\nhad not evolved from a custom one-off solution into an easy to use\nfeature for everyone.\n\nWe solved this problem by building the pack-objects cache, which we\nwill describe in more detail in the next blog post. Unlike the CI pre-clone script,\nwhich was a separate component, the pack-objects cache sits inside\nGitaly. It is always on, for all repositories and all users on\nGitLab.com. If you run your own GitLab server you can also use the\npack-objects cache, but you do have to [turn it on\nfirst](https://docs.gitlab.com/ee/administration/gitaly/configure_gitaly.html#pack-objects-cache).\n\n## Performance comparison\n\nTo illustrate what has changed we have created a benchmark. We set up a GitLab\nserver with a clone of `gitlab-org/gitlab` on it, and we configured a\nclient machine to perform 20 simultaneous shallow clones of the same commit using Git HTTP.[^ssh] This\nsimulates having a CI pipeline with 20 parallel jobs. The pack data is\nabout 87MB so in terms of bandwidth, we are transferring `20 * 87 =\n1740MB` of data.\n\n[^ssh]: As of GitLab 14.6, Git HTTP is 3x more CPU-efficient on the server than Git SSH. We are working on [improving the efficiency of Git SSH in GitLab](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/652). We prioritized optimizing Git HTTP because that is what GitLab CI uses.\n\nWe did this experiment with two GitLab servers. Both were Google\nCompute Engine `c2-standard-8` virtual machines with 8 CPU cores and\n32GB RAM. The operating system was Ubuntu 20.04 and we installed\nGitLab using our Omnibus packages.\n\n### Before\n\n- GitLab FOSS 13.7.9 (released December 2020)\n- Default Omnibus configuration\n\nThe 30-second [Perf flamegraph](https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html) below was captured at 99Hz across all CPU's.\n\n![Flamegraph of GitLab 13.7 performance](https://about.gitlab.com/images/blogimages/git-fetch-2021/before.jpg)\n\nSource: [SVG](/images/blogimages/git-fetch-2021/before.svg)\n\n### After\n\n- GitLab FOSS 14.6.1 (released December 2021)\n- One extra setting in `/etc/gitlab/gitlab.rb`:\n\n```ruby\ngitaly['pack_objects_cache_enabled'] = true\n```\n\n![Flamegraph of GitLab 14.6 performance with\ncache](https://about.gitlab.com/images/blogimages/git-fetch-2021/after.jpg)\n\nSource: [SVG](/images/blogimages/git-fetch-2021/after.svg)\n\n### Analysis\n\nServer CPU profile distribution:\n\n|Value|Before|After\n|---|---|---|\n|Benchmark run time|27s|7.5s|\n|`git` profile samples|18 552|923|\n|`gitaly` samples (Git RPC server process)|1 247|331|\n|`gitaly-hooks` samples (pack-objects cache client)||258|\n|`gitlab-workhorse` samples (application HTTP frontend)|1 057|237|\n|`nginx` samples (main HTTP frontend)|474|251|\n|Total CPU busy samples|21 720|2 328|\n|CPU utilization during benchmark|100%|40%|\n\n### Conclusion\n\nCompared to GitLab 13.6 (December 2020), GitLab 14.6 (December 2021) plus the\npack-objects cache makes the CI fetch benchmark in this post run 3.6x faster.\nAverage server CPU utilization during the benchmark dropped from 100%\nto 40%.\n\nStay tuned for part 2 of this blog post, in which we will go over the\nchanges we made to make this happen.\n\n## Related content\n\n- [Gitaly pack-objects cache documentation](https://docs.gitlab.com/ee/administration/gitaly/configure_gitaly.html#pack-objects-cache)\n- [Epic to improve Git SSH efficiency in GitLab](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/652)\n",[757,864,9],{"slug":885,"featured":6,"template":688},"git-fetch-performance","content:en-us:blog:git-fetch-performance.yml","Git Fetch Performance","en-us/blog/git-fetch-performance.yml","en-us/blog/git-fetch-performance",{"_path":891,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":892,"content":898,"config":905,"_id":907,"_type":13,"title":908,"_source":15,"_file":909,"_stem":910,"_extension":18},"/en-us/blog/git-performance-on-nfs",{"title":893,"description":894,"ogTitle":893,"ogDescription":894,"noIndex":6,"ogImage":895,"ogUrl":896,"ogSiteName":672,"ogType":673,"canonicalUrls":896,"schema":897},"What we're doing to fix Gitaly NFS performance regressions","How we're improving our Git IO patterns to fix performance regressions when running Gitaly on NFS.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749670065/Blog/Hero%20Images/git-performance-nfs.jpg","https://about.gitlab.com/blog/git-performance-on-nfs","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"What we're doing to fix Gitaly NFS performance regressions\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"James Ramsay\"},{\"@type\":\"Person\",\"name\":\"Zeger-Jan van de Weg\"}],\n        \"datePublished\": \"2019-07-08\",\n      }",{"title":893,"description":894,"authors":899,"heroImage":895,"date":902,"body":903,"category":681,"tags":904},[900,901],"James Ramsay","Zeger-Jan van de Weg","2019-07-08","\nFrom the start, Gitaly, GitLab's service that is the interface to our Git data,\nfocused on removing the dependency on NFS. We achieved this task at the end\nof the summer 2018, when the [NFS drives were unmounted on GitLab.com][gitaly-nfs-blog].\nThe migration was geared towards improving the availability of Git data at\nGitLab and correctness, that is: fixing bugs. To an extent, performance\nwas an afterthought. By rewriting most of the RPCs in Go there were side effects\nthat positively improved performance, but conversely there were also occasions\nwhere performance wasn't addressed immediately, but rather added to the backlog\nfor the next iteration.\n\nSince releasing Gitaly 1.0, and updating GitLab to use Gitaly instead of Rugged\nfor all Git operations, we have observed severe performance regressions for\nlarge GitLab instances when using NFS. To address these performance problems in\nGitLab 11.9, we added [feature flags][feature-flag-docs] to enable\nRugged implementations that improve performance for affected GitLab instances.\nThese have been back ported to 11.5-11.8.\n\n### So what's the problem?\n\nWhile the migration was under way, there were noticeable performance regressions.\nIn most cases, these were so-called N + 1 access patterns. One example was the\n[pipeline index view](https://gitlab.com/gitlab-org/gitlab-ce/pipelines/), where\neach pipeline runs on a commit. On that page, GitLab used to call the `FindCommit`\nRPC for each pipeline. To improve performance, a new RPC was added;\n`ListCommitsByOid`. In which case, the object IDs for the commits were collected\nfirst, once request was made to Gitaly to get all the commits and return them to\ncontinue rendering the view.\n\nThis approach was, and still is, successful. However, detecting these N + 1\nqueries is hard. When GitLab is run for development as part of the GDK, or\nduring testing, a special N + 1 detector will raise an error if an N + 1\noccurred. This approach has several shortcomings, for one; most tests will only\ntest the behavior of one entity, not 20. This reduces the likelihood of the\nerror being raised. There is also a way to silence N + 1 errors, for example:\n\n```ruby\nproject = Project.find(1)\n\nGitalyClient.allow_n_plus_1 do\n  project.pipelines.last(20).each do |pipeline|\n    project.repository.find_commit(pipeline.sha)\n  end\nend\n\n# The better solution would be\n\nshas = project.pipelines.last(20).select(&:sha)\nrepository.list_commits_by_oid(shas)\n```\n\nWhatever happened in that block would not be counted. For each of these blocks\nissues were created and added to [an epic][epic-nplus1], however, little\nprogress was made by the teams who had bypassed these checks in this way. This\nwas primarily because these performance issues were not a big\nproblem for GitLab.com, despite the fact they had become a problem for our customers.\n\nThe detected N + 1 issues included a lot of Git object read operations, for\nexample the `FindCommit` RPC. This is especially bad because this requires a\nnew Git process to be invoked to fetch each commit. If a millisecond later\nanother request comes in for the same repository, Gitaly will invoke Git again\nand Git will do all this work again. Before the migration and when GitLab.com\nwas still using NFS, GitLab leveraged Rugged, and used memoization to keep around\nthe Rugged Repository until the Rails request was done. This allowed Rugged to\nload part of the Git repository into memory for faster access for subsequent\nrequests. This property was not recreated in Gitaly for some time.\n\n## Enter cat-file cache\n\nIn GitLab 12.1, Gitaly will cache a repository per Rails session to recreate this\nbehavior with a feature called ['cat-file' cache](https://gitlab.com/gitlab-org/gitaly/merge_requests/1203).\nTo explain how this cache works and its name, it should be noted that objects\nin Git are compressed using [zlib][zlib]. This means that a commit object\nisn't packed and can be located on disk, it seemingly contains garbage:\n\n```\n# This example is an empty .gitkeep file\n$ cat .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391\nxKOR0`\n```\n\nNow cat-file will query for the object, and when using the `-p` flag pretty print\nit. In the following example, the current [Gitaly license][gitaly-mit].\n\n```\n$ git cat-file -p c7344c56da804e88a0bca979a53e1ec1c8b6021e\nThe MIT License (MIT)\n... ommitted\n```\n\nCat-file has another flag, `--batch`, which allows for multiple objects to be\nrequested to the same process through STDIN.\n\n```\n$ git cat-file --batch\nc7344c56da804e88a0bca979a53e1ec1c8b6021e\nc7344c56da804e88a0bca979a53e1ec1c8b6021e blob 1083\nThe MIT License (MIT)\n\n... ommitted\n```\n\nInspecting the Git process using [strace][strace] allows us to inspect how Git\namortizes expensive operations to improve performance. The output on STDOUT and\nthe strace are available [as a snippet](https://gitlab.com/snippets/1858975).\n\nThe process is reading the first input from STDIN, or file descriptor 0, at\n[line 141](https://gitlab.com/snippets/1858975#L141). It starts writing the output\nabout [40 syscalls later](https://gitlab.com/snippets/1858975#L180). In between\nthere are two important operations performed: an\n[mmap of the pack file index](https://gitlab.com/snippets/1858975#L167), and\nanother [mmap of the pack file itself](https://gitlab.com/snippets/1858975#L177).\nThese operations store part of these files in memory, so that they are available\nthe next time they are needed.\n\nIn the snippet, we've requested the same blob on the same process again. This a\nsyntactic follow-up request, but even when the next request would've been `HEAD`\nGit would have to do a considerable amount less work to come up with the object\nthat `HEAD` deferences to.\n\nKeeping a cat-file process around for subsequent requests was shipped in\nGitLab 11.11 behind the `gitaly_catfile-cache` feature flag, and will be\n[enabled by default][remove-ff] in GitLab 12.1.\n\n### Next steps\n\nThe `cat-file` cache is one of many improvements being made to improve Git IO\npatterns in GitLab, to mitigate slow IO when using NFS and improve performance\nof GitLab. Particularly, progress has been made in GitLab 11.11, and continues\nto be made in eliminating the worst N + 1 access patterns from GitLab. You can\nfollow [gitlab-org&1190][epic-worst-io] for\nthe full plan and progress.\n\nThe Gitaly team's highest priority is\n[automatically enabling Rugged][automatic-rugged]\nfor GitLab servers using NFS to immediately mitigate the performance\nregressions until performance improvements are sufficiently complete in GitLab\nand Gitaly, allowing Rugged to again be removed.\n\nIn the future, we will remove the need for NFS with\n[High Availability for Gitaly][ha-epic], providing both performance and\navailability, and eliminating the burden of maintaining an NFS cluster.\n\nCover image by [Jannes Glas](https://unsplash.com/@jannesglas) on [Unsplash](https://unsplash.com/photos/P6iOpqQpwwU)\n{: .note}\n\n[automatic-rugged]: https://gitlab.com/gitlab-org/gitlab-ce/issues/60931\n[epic-nplus1]: https://gitlab.com/groups/gitlab-org/-/epics/827\n[epic-worst-io]: https://gitlab.com/groups/gitlab-org/-/epics/1190\n[feature-flag-docs]: https://docs.gitlab.com/ee/administration/nfs.html#improving-nfs-performance-with-gitlab\n[gitaly-mit]: https://gitlab.com/gitlab-org/gitaly/blob/1b09f13374be5b272d40b3b089372adae2801f81/LICENSE\n[gitaly-nfs-blog]: /2018/09/12/the-road-to-gitaly-1-0/\n[ha-epic]: https://gitlab.com/groups/gitlab-org/-/epics/842\n[remove-ff]: https://gitlab.com/gitlab-org/gitaly/issues/1671\n[strace]: https://strace.io/\n[zlib]: https://www.zlib.net/\n",[757,9],{"slug":906,"featured":6,"template":688},"git-performance-on-nfs","content:en-us:blog:git-performance-on-nfs.yml","Git Performance On Nfs","en-us/blog/git-performance-on-nfs.yml","en-us/blog/git-performance-on-nfs",{"_path":912,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":913,"content":919,"config":926,"_id":928,"_type":13,"title":929,"_source":15,"_file":930,"_stem":931,"_extension":18},"/en-us/blog/gitlab-changes-to-cloudflare",{"title":914,"description":915,"ogTitle":914,"ogDescription":915,"noIndex":6,"ogImage":916,"ogUrl":917,"ogSiteName":672,"ogType":673,"canonicalUrls":917,"schema":918},"Why GitLab.com is changing its CDN provider to Cloudflare March 28","Get the scoop on our plan to change GitLab.com to Cloudflare.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749665811/Blog/Hero%20Images/daytime-clouds.jpg","https://about.gitlab.com/blog/gitlab-changes-to-cloudflare","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Why GitLab.com is changing its CDN provider to Cloudflare March 28\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"David Smith\"}],\n        \"datePublished\": \"2020-01-16\",\n      }",{"title":914,"description":915,"authors":920,"heroImage":916,"date":922,"body":923,"category":681,"tags":924},[921],"David Smith","2020-01-16","\n\n## Upcoming changes to our CDN for GitLab.com\n\nAs GitLab.com has grown, so have our needs around the security and scalability of the web application. We are in the process of changing our CDN provider to [Cloudflare](https://www.cloudflare.com/) as part of our improvements to GitLab.com. We are approaching this change with care, and this post is to let everyone know about the shift ahead of time.\n\n## Update on timing\n\nWe have picked the weekend of March 28, 2020 to do the switch to Cloudflare.  Recent incident work for issues on GitLab.com has made us decide to push back from March 21 which was our date published last week.\n\n### Why are we working on this?\n\nWe are currently using [Fastly](https://www.fastly.com) for serving static content, but we want to improve GitLab.com availability, security, and performance with other tools like a Web Application Firewall (WAF), [Spectrum](https://www.cloudflare.com/products/cloudflare-spectrum/), and [Argo](https://www.cloudflare.com/products/argo-smart-routing/). We also want to preserve the current workflow: Interacting with GitLab.com for both `git` and web application interactions. Since GitLab.com serves more than just https traffic, the change is a little more complicated. The traffic pattern requires we use a solution that could handle traffic for port 22 and port 443. As a result of the complexity and requirements, we realized we would like to have a solution for CDN, WAF, and DDOS protection with one vendor.\n\nDuring the summer of 2019, we did evaluations and chose Cloudflare as the vendor who could best meet our requirements. Now that we are closer to switching over, we have created a [readiness review](https://gitlab.com/gitlab-com/gl-infra/readiness/tree/master/cloudflare) to talk about our plans for the change over.\n\n### What you need to know\n\nFirst, this change will not affect self-managed users of GitLab, this is only for users of GitLab.com. At a very high level, most users of GitLab.com will not need to take any action.\n\nGitLab.com users with a whitelist of sites in their firewall setup will need to change what is whitelisted for GitLab.com. For the initial change, we will be switching DNS to Cloudflare. This will cause all GitLab.com traffic to be proxied through Cloudflare. This change will be visible by changes in DNS records queried for GitLab.com.\nA whitelist of IPs can be found [here](https://www.cloudflare.com/ips/).\nWe wanted to make sure this is communicated ahead of time, as this is an important detail, which may be in use by some firewalling setups.\n\nSSH-based `git` actions via `altssh.gitlab.com` on port 443 continue to be supported. As with GitLab.com, any firewalls you set up might need to be reconfigured to the new IP ranges.\n\nCustom runner images or private runners could also be affected if they have any kind of caching of DNS or SSL certificates.\n\n### How can I stay up to date on when the change will happen?\n\nWe will update this blog post, [GitLab status](https://status.gitlab.com), and [@gitlabstatus on twitter](https://www.twitter.com/gitlabstatus) with the planned date of this initial change – likely sometime in early February 2020. When it is time for the change on GitLab.com, we will also update [GitLab.com ranges](https://docs.gitlab.com/ee/user/gitlab_com/#ip-range) with the range from [Cloudflare](https://www.cloudflare.com/ips/).\n\nOnce we know traffic is flowing through Cloudflare successfully, we will start exploring more features like the WAF in logging-only mode.  We will also test [Argo](https://www.cloudflare.com/products/argo-smart-routing/) and we hope again that traffic to GitLab.com is faster.\n\nFeel free to ask our support team your questions, and they will be able to talk to our infrastructure team for the details. Thanks for your continued support and check here for more updates soon!\n\n### Links to our plans and other information\n\n1. [GitLab status: Subscribe by email, twitter, webhook, slack](https://status.gitlab.com)\n2. [More discussion about this blog post](https://gitlab.com/gitlab-com/www-gitlab-com/issues/5907)\n3. [Production readiness review MR](https://gitlab.com/gitlab-com/gl-infra/readiness/tree/master/cloudflare)\n4. [Top-level epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/94)\n5. [Cloudflare privacy policy](https://www.cloudflare.com/privacypolicy/)\n6. [Cloudflare IP ranges](https://www.cloudflare.com/ips/)\n7. [Cloudflare Prometheus Exporter](https://gitlab.com/gitlab-org/cloudflare_exporter)\n\n\n### Definitions\n- Web Application Firewall (WAF): A type of firewall that helps protect web applications from a specific set of attacks\n- Argo: Cloudflare product that helps route web traffic across the fastest and most reliable network paths\n- Spectrum: A Cloudflare product that helps secure the types of ports that GitLab.com uses for SSH access\n\nCover image by [Sam Schooler](https://unsplash.com/photos/E9aetBe2w40) on [Unsplash](https://unsplash.com/)\n{: .note}\n",[9,864,925],"security",{"slug":927,"featured":6,"template":688},"gitlab-changes-to-cloudflare","content:en-us:blog:gitlab-changes-to-cloudflare.yml","Gitlab Changes To Cloudflare","en-us/blog/gitlab-changes-to-cloudflare.yml","en-us/blog/gitlab-changes-to-cloudflare",{"_path":933,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":934,"content":940,"config":949,"_id":951,"_type":13,"title":952,"_source":15,"_file":953,"_stem":954,"_extension":18},"/en-us/blog/gitlab-com-stability-post-gcp-migration",{"title":935,"description":936,"ogTitle":935,"ogDescription":936,"noIndex":6,"ogImage":937,"ogUrl":938,"ogSiteName":672,"ogType":673,"canonicalUrls":938,"schema":939},"What's up with GitLab.com? Check out the latest data on its stability","Let's take a look at the data on the stability of GitLab.com from before and after our recent migration from Azure to GCP, and dive into why things are looking up.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749671280/Blog/Hero%20Images/gitlab-gke-integration-cover.png","https://about.gitlab.com/blog/gitlab-com-stability-post-gcp-migration","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"What's up with GitLab.com? Check out the latest data on its stability\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Andrew Newdigate\"}],\n        \"datePublished\": \"2018-10-11\",\n      }",{"title":935,"description":936,"authors":941,"heroImage":937,"date":943,"body":944,"category":681,"tags":945},[942],"Andrew Newdigate","2018-10-11","\nThis post is inspired by [this comment on Reddit](https://www.reddit.com/r/gitlab/comments/9f71nq/thanks_gitlab_team_for_improving_the_stability_of/),\nthanking us for improving the stability of GitLab.com. Thanks, hardwaresofton! Making GitLab.com\nready for your mission-critical workloads has been top of mind for us for some time, and it's\ngreat to hear that users are noticing a difference.\n\n_Please note that the numbers in this post differ slightly from the Reddit post as the data has changed since that post._\n\nWe will continue to work hard on improving the availability and stability of the platform. Our\ncurrent goal is to achieve 99.95 percent availability on GitLab.com – look out for an upcoming\npost about how we're planning to get there.\n\n## GitLab.com stability before and after the migration\n\nAccording to [Pingdom](http://stats.pingdom.com/81vpf8jyr1h9), GitLab.com's availability for the year to date, up until the migration was **[99.68 percent](https://docs.google.com/spreadsheets/d/1uJ_zacNvJTsvJUfNpi1D_aPBg-vNJC1xJzsSwGKKt8g/edit#gid=527563485&range=F2)**, which equates to about 32 minutes of downtime per week on average.\n\nSince the migration, our availability has improved greatly, although we have much less data to compare with than in Azure.\n\n![Availability Chart](https://docs.google.com/spreadsheets/d/e/2PACX-1vQg_tdtdZYoC870W3u2R2icSK0Rd9qoOtDJqYHALaQlzhxXOmfY63X1NMMyFVEypQs7NngR4UUIZx5R/pubchart?oid=458170195&format=image)\n\nUsing data publicly available from Pingdom, here are some stats about our availability for the year to date:\n\n| Period                                 | Mean-time between outage events |\n| -------------------------------------- | ------------------------------- |\n| Pre-migration (Azure)                  | **1.3 days**                    |\n| Post-migration (GCP)                   | **7.3 days**                    |\n| Post-migration (GCP) excluding 1st day | **12 days**                     |\n\nThis is great news: we're experiencing outages less frequently. What does this mean for our availability, and are we on track to achieve our goal of 99.95 percent?\n\n| Period                    | Availability                                                                                                                   | Downtime per week |\n| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------- |\n| Pre-migration (Azure)     | **[99.68%](https://docs.google.com/spreadsheets/d/1uJ_zacNvJTsvJUfNpi1D_aPBg-vNJC1xJzsSwGKKt8g/edit#gid=527563485&range=F2)**  | **32 minutes**    |\n| Post-migration (GCP)      | **[99.88 %](https://docs.google.com/spreadsheets/d/1uJ_zacNvJTsvJUfNpi1D_aPBg-vNJC1xJzsSwGKKt8g/edit#gid=527563485&range=B3)** | **13 minutes**    |\n| Target – not yet achieved | **99.95%**                                                                                                                     | **5 minutes**     |\n\nDropping from 32 minutes per week average downtime to 13 minutes per week means we've experienced a **61 percent improvement** in our availability following our migration to Google Cloud Platform.\n\n## Performance\n\nWhat about the performance of GitLab.com since the migration?\n\nPerformance can be tricky to measure. In particular, averages are a terrible way of measuring performance, since they neglect outlying values. One of the better ways to measure performance is with a latency histogram chart. To do this, we imported the GitLab.com access logs for July (for Azure) and September (for Google Cloud Platform) into [Google BigQuery](https://cloud.google.com/bigquery/), then selected the 100 most popular endpoints for each month and categorised these as either API, web, git, long-polling, or static endpoints. Comparing these histograms side-by-side allows us to study how the performance of GitLab.com has changed since the migration.\n\n![GitLab.com Latency Histogram](https://about.gitlab.com/images/blogimages/whats-up-with-gitlab-com/azure_v_gcp_latencies.gif)\n\nIn this histogram, higher values on the left indicate better performance. The right of the graph is the \"_tail_\", and the \"_fatter the tail_\", the worse the user experience.\n\nThis graph shows us that with the move to GCP, more requests are completing within a satisfactory amount of time.\n\nHere's two more graphs showing the difference for API and Git requests respectively.\n\n![API Latency Histogram](https://about.gitlab.com/images/blogimages/whats-up-with-gitlab-com/api-performance-histogram.png)\n\n![Git Latency Histogram](https://about.gitlab.com/images/blogimages/whats-up-with-gitlab-com/git-performance-histogram.png)\n\n## Why these improvements?\n\nWe chose Google Cloud Platform because we believe that Google offer the most reliable cloud platform for our workload, particularly as we move towards running GitLab.com in [Kubernetes](/solutions/kubernetes/).\n\nHowever, there are many other reasons unrelated to our change in cloud provider for these improvements to stability and performance.\n\n> #### _“We chose Google Cloud Platform because we believe that Google offer the most reliable cloud platform for our workload”_\n\nLike any large SaaS site, GitLab.com is a large, complicated system, and attributing availability changes to individual changes is extremely difficult, but here are a few factors which may be effecting our availability and performance:\n\n### Reason #1: Our Gitaly Fleet on GCP is much more powerful than before\n\nGitaly is responsible for all Git access in the GitLab application. Before Gitaly, Git access occurred directly from within Rails workers. Because of the scale we run at, we require many servers serving the web application, and therefore, in order to share git data between all workers, we relied on NFS volumes. Unfortunately this approach doesn't scale well, which led to us building Gitaly, a dedicated Git service.\n\n> #### _“We've opted to give our fleet of 24 Gitaly servers a serious upgrade”_\n\n#### Our upgraded Gitaly fleet\n\nAs part of the migration, we've opted to give our fleet of 24 [Gitaly](/blog/the-road-to-gitaly-1-0/) servers a serious upgrade. If the old fleet was the equivalent of a nice family sedan, the new fleet are like a pack of snarling musclecars, ready to serve your Git objects.\n\n| Environment | Processor                       | Number of cores per instance | RAM per instance |\n| ----------- | ------------------------------- | ---------------------------- | ---------------- |\n| Azure       | Intel Xeon Ivy Bridge @ 2.40GHz | 8                            | 55GB             |\n| GCP         | Intel Xeon Haswell @ 2.30GHz    | **32**                       | **118GB**        |\n\nOur new Gitaly fleet is much more powerful. This means that Gitaly can respond to requests more quickly, and deal better with unexpected traffic surges.\n\n#### IO performance\n\nAs you can probably imagine, serving [225TB of Git data](https://dashboards.gitlab.com/d/ZwfWfY2iz/vanity-metrics-dashboard?orgId=1) to roughly half-a-million active users a week is a fairly IO-heavy operation. Any performance improvements we can make to this will have a big impact on the overall performance of GitLab.com.\n\nFor this reason, we've focused on improving performance here too.\n\n| Environment | RAID         | Volumes | Media    | filesystem | Performance                                                            |\n| ----------- | ------------ | ------- | -------- | ---------- | ---------------------------------------------------------------------- |\n| Azure       | RAID 5 (lvm) | 16      | magnetic | xfs        | 5k IOPS, 200MB/s (_per disk_) / 32k IOPS **1280MB/s** (_volume group_) |\n| GCP         | No raid      | 1       | **SSD**  | ext4       | **60k read IOPs**, 30k write IOPs, 800MB/s read 200MB/s write          |\n\nHow does this translate into real-world performance? Here are average read and write times across our Gitaly fleet:\n\n##### IO performance is much higher\n\nHere are some comparative figures for our Gitaly fleet from Azure and GCP. In each case, the performance in GCP is much better than in Azure, although this is what we would expect given the more powerful fleet.\n\n[![Disk read time graph](https://docs.google.com/spreadsheets/d/e/2PACX-1vQg_tdtdZYoC870W3u2R2icSK0Rd9qoOtDJqYHALaQlzhxXOmfY63X1NMMyFVEypQs7NngR4UUIZx5R/pubchart?oid=458168633&format=image)](https://docs.google.com/spreadsheets/d/1uJ_zacNvJTsvJUfNpi1D_aPBg-vNJC1xJzsSwGKKt8g/edit#gid=1002437172) [![Disk write time graph](https://docs.google.com/spreadsheets/d/e/2PACX-1vQg_tdtdZYoC870W3u2R2icSK0Rd9qoOtDJqYHALaQlzhxXOmfY63X1NMMyFVEypQs7NngR4UUIZx5R/pubchart?oid=884528549&format=image)](https://docs.google.com/spreadsheets/d/1uJ_zacNvJTsvJUfNpi1D_aPBg-vNJC1xJzsSwGKKt8g/edit#gid=1002437172) [![Disk Queue length graph](https://docs.google.com/spreadsheets/d/e/2PACX-1vQg_tdtdZYoC870W3u2R2icSK0Rd9qoOtDJqYHALaQlzhxXOmfY63X1NMMyFVEypQs7NngR4UUIZx5R/pubchart?oid=2135164979&format=image)](https://docs.google.com/spreadsheets/d/1uJ_zacNvJTsvJUfNpi1D_aPBg-vNJC1xJzsSwGKKt8g/edit#gid=1002437172)\n\nNote: For reference: for Azure, this uses the average times for the week leading up to the failover. For GCP, it's an average for the week up to October 2, 2018.\n\nThese stats clearly illustrate that our new fleet has far better IO performance than our old cluster. Gitaly performance is highly dependent on IO performance, so this is great news and goes a long way to explaining the performance improvements we're seeing.\n\n### Reason #2: Fewer \"unicorn worker saturation\" errors\n\n![HTTP 503 Status GitLab](https://about.gitlab.com/images/blogimages/whats-up-with-gitlab-com/facepalm-503.png)\n\nUnicorn worker saturation sounds like it'd be a good thing, but it's really not!\n\nWe ([currently](https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/1899)) rely on [unicorn](https://bogomips.org/unicorn/), a Ruby/Rack http server, for serving much of the application. Unicorn uses a single-threaded model, which uses a fixed pool of workers processes. Each worker can handle only one request at a time. If the worker gives no response within 60 seconds, it is terminated and another process is spawned to replace it.\n\n> #### _“Unicorn worker saturation sounds like it'd be a good thing, but it's really not!”_\n\nAdd to this the lack of autoscaling technologies to ramp the fleet up when we experience high load volumes, and this means that GitLab.com has a relatively static-sized pool of workers to handle incoming requests.\n\nIf a Gitaly server experiences load problems, even fast [RPCs](https://en.wikipedia.org/wiki/Remote_procedure_call) that would normally only take milliseconds, could take up to several seconds to respond – thousands of times slower than usual. Requests to the unicorn fleet that communicate with the slow server will take hundreds of times longer than expected. Eventually, most of the fleet is handling requests to that affected backend server. This leads to a queue which affects all incoming traffic, a bit like a tailback on a busy highway caused by a traffic jam on a single offramp.\n\nIf the request gets queued for too long – after about 60 seconds – the request will be cancelled, leading to a 503 error. This is indiscriminate – all requests, whether they interact with the affected server or not, will get cancelled. This is what I call unicorn worker saturation, and it's a very bad thing.\n\nBetween February and August this year we frequently experienced this phenomenon.\n\nThere are several approaches we've taken to dealing with this:\n\n- **Fail fast with aggressive timeouts and circuitbreakers**: Timeouts mean that when a Gitaly request is expected to take a few milliseconds, they time out after a second, rather than waiting for the request to time out after 60 seconds. While some requests will still be affected, the cluster will remain generally healthy. Gitaly currently doesn't use circuitbreakers, but we plan to add this, possibly using [Istio](https://istio.io/docs/tasks/traffic-management/circuit-breaking/) once we've moved to Kubernetes.\n\n- **Better abuse detection and limits**: More often than not, server load spikes are driven by users going against our fair usage policies. We built tools to better detect this and over the past few months, an abuse team has been established to deal with this. Sometimes, load is driven through huge repositories, and we're working on reinstating fair-usage limits which prevent 100GB Git repositories from affecting our entire fleet.\n\n- **Concurrency controls and rate limits**: For limiting the blast radius, rate limiters (mostly in HAProxy) and concurrency limiters (in Gitaly) slow overzealous users down to protect the fleet as a whole.\n\n### Reason #3: GitLab.com no longer uses NFS for any Git access\n\nIn early September we disabled Git NFS mounts across our worker fleet. This was possible because Gitaly had reached v1.0: the point at which it's sufficiently complete. You can read more about how we got to this stage in our [Road to Gitaly blog post](/blog/the-road-to-gitaly-1-0/).\n\n### Reason #4: Migration as a chance to reduce debt\n\nThe migration was a fantastic opportunity for us to improve our infrastructure, simplify some components, and otherwise make GitLab.com more stable and more observable, for example, we've rolled out new **structured logging infrastructure**.\n\nAs part of the migration, we took the opportunity to move much of our logging across to structured logs. We use [fluentd](https://www.fluentd.org/), [Google Pub/Sub](https://cloud.google.com/pubsub/docs/overview), [Pubsubbeat](https://github.com/GoogleCloudPlatform/pubsubbeat), storing our logs in [Elastic Cloud](https://www.elastic.co/cloud) and [Google Stackdriver Logging](https://cloud.google.com/logging/). Having reliable, indexed logs has allowed us to reduce our mean-time to detection of incidents, and in particular detect abuse. This new logging infrastructure has also been invaluable in detecting and resolving several security incidents.\n\n> #### _“This new logging infrastructure has also been invaluable in detecting and resolving several security incidents”_\n\nWe've also focused on making our staging environment much more similar to our production environment. This allows us to test more changes, more accurately, in staging before rolling them out to production. Previously the team was maintaining\na limited scaled-down staging environment and many changes were not adequately tested before being rolled out. Our environments now share a common configuration and we're working to automate all [terraform](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5079) and [chef](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5078) rollouts.\n\n### Reason #5: Process changes\n\nUnfortunately many of the worst outages we've experienced over the past few years have been self-inflicted. We've always been transparent about these — and will continue to be so — but as we rapidly grow, it's important that our processes scale alongside our systems and team.\n\n> #### _“It's important that our processes scale alongside our systems and team”_\n\nIn order to address this, over the past few months, we've formalized our change and incident management processes. These processes respectively help us to avoid outages and resolve them quicker when they do occur.\n\nIf you're interested in finding out more about the approach we've taken to these two vital disciplines, they're published in our handbook:\n\n- [GitLab.com's Change Management Process](/handbook/engineering/infrastructure/change-management/)\n- [GitLab.com's Incident Management Process](/handbook/engineering/infrastructure/incident-management/)\n\n### Reason #6: Application improvement\n\nEvery GitLab release includes [performance and stability improvements](https://gitlab.com/gitlab-org/gitlab-ce/issues?scope=all&state=opened&label_name%5B%5D=performance); some of these have had a big impact on GitLab's stability and performance, particularly n+1 issues.\n\nTake Gitaly for example: like other distributed systems, Gitaly can suffer from a class of performance degradations known as \"n+1\" problems. This happens when an endpoint needs to make many queries (_\"n\"_) to fulfill a single request.\n\n> Consider an imaginary endpoint which queried Gitaly for all tags on a repository, and then issued an additional query for each tag to obtain more information. This would result in n + 1 Gitaly queries: one for the initial tag, and then n for the tags. This approach would work fine for a project with 10 tags – issuing 11 requests, but a project with 1000 tags, this would result in 1001 Gitaly calls, each with a round-trip time, and issued in sequence.\n\n![Latency drop in Gitaly endpoints](https://about.gitlab.com../../images/blogimages/whats-up-with-gitlab-com/drop-off.png)\n\nUsing data from Pingdom, this chart shows long-term performance trends since the start of the year. It's clear that latency improved a great deal on May 7, 2018. This date happens to coincide with the RC1 release of GitLab 10.8, and its deployment on GitLab.com.\n\nIt turns out that this was due to a [single fix on n+1 on the merge request page being resolved](https://gitlab.com/gitlab-org/gitlab-ce/issues/44052).\n\nWhen running in development or test mode, GitLab now detects n+1 situations and we have compiled [a list of known n+1s](https://gitlab.com/gitlab-org/gitlab-ce/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=performance&label_name[]=Gitaly&label_name[]=technical%20debt). As these are resolved we expect even more performance improvements.\n\n![GitLab Summit - South Africa - 2018](https://about.gitlab.com/images/summits/2018_south-africa_team.jpg)\n\n### Reason #7: Infrastructure team growth and reorganization\n\nAt the start of May 2018, the Infrastructure team responsible for GitLab.com consisted of five engineers.\n\nSince then, we've had a new director join the Infrastructure team, two new managers, a specialist [Postgres DBRE](https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/13778), and four new [SREs](https://handbook.gitlab.com/job-families/engineering/infrastructure/site-reliability-engineer/). The database team has been reorganized to be an embedded part of infrastructure group. We've also brought in [Ongres](https://www.ongres.com/), a specialist Postgres consultancy, to work alongside the team.\n\nHaving enough people in the team has allowed us to be able to split time between on-call, tactical improvements, and longer-term strategic work.\n\nOh, and we're still hiring! If you're interested, check out [our open positions](/jobs/) and choose the Infrastructure Team 😀\n\n## TL;DR: Conclusion\n\n1. GitLab.com is more stable: availability has improved 61 percent since we migrated to GCP\n1. GitLab.com is faster: latency has improved since the migration\n1. We are totally focused on continuing these improvements, and we're building a great team to do it\n\nOne last thing: our Grafana dashboards are open, so if you're interested in digging into our metrics in more detail, visit [dashboards.gitlab.com](https://dashboards.gitlab.com) and explore!\n",[946,756,754,947,948,9],"GKE","kubernetes","news",{"slug":950,"featured":6,"template":688},"gitlab-com-stability-post-gcp-migration","content:en-us:blog:gitlab-com-stability-post-gcp-migration.yml","Gitlab Com Stability Post Gcp Migration","en-us/blog/gitlab-com-stability-post-gcp-migration.yml","en-us/blog/gitlab-com-stability-post-gcp-migration",{"_path":956,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":957,"content":963,"config":969,"_id":971,"_type":13,"title":972,"_source":15,"_file":973,"_stem":974,"_extension":18},"/en-us/blog/gitlab-importers",{"title":958,"description":959,"ogTitle":958,"ogDescription":959,"noIndex":6,"ogImage":960,"ogUrl":961,"ogSiteName":672,"ogType":673,"canonicalUrls":961,"schema":962},"How to migrate data to GitLab using main importers","Learn about the capabilities of main importers, which are used to import data from external tools and from other GitLab instances.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749679170/Blog/Hero%20Images/migration-data.jpg","https://about.gitlab.com/blog/gitlab-importers","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How to migrate data to GitLab using main importers\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Itzik Gan Baruch\"}],\n        \"datePublished\": \"2023-02-13\",\n      }",{"title":958,"description":959,"authors":964,"heroImage":960,"date":966,"body":967,"category":681,"tags":968},[965],"Itzik Gan Baruch","2023-02-13","\n\nA typical organization looking to adopt GitLab already uses many other tools. Artifacts such as code, build pipelines, issues, and epics will already exist and be changed daily. A seamless transition of work in progress is, therefore, critically important when importing data. GitLab importers aim to make this process easy and reliable, ensuring data is imported quickly and with maximum care.\n\nAt GitLab, a dedicated development team, named group:import, creates a seamless experience when importing data into GitLab or from one GitLab instance to another. This team continuously develops and improves the importing experience and keeps our importers up to date with new features and capabilities.\n\n## Migrate groups by direct transfer\n\nUsing group migration, you can import groups from one GitLab instance to another instance. The most common use case is to import groups from self-managed GitLab instances to GitLab.com (GitLab SaaS). With the group migration, you can migrate many groups in a single click.\n\n### Which items are imported?\n\nThe group migration imports the entire group structure, including all the sub groups and projects in them. Currently, to import projects as part of the group migration on self-managed GitLab, the administrator needs to enable the feature flag named `bulk_import_projects`. On GitLab.com, our SaaS offering, migration of both groups and projects is available. More information can be found in our [documentation](https://docs.gitlab.com/ee/user/group/import/#migrate-groups-by-direct-transfer-recommended).\n\nThe team continuously adds objects to the migration, but not all group items are imported. The docs cover the [items that are imported](https://docs.gitlab.com/ee/user/group/import/#migrated-group-items). \n\n### How can groups be imported?\n\nIt is very simple to import groups between two instances. Here are the steps: \n\n- Create a new group or subgroup in the designated instance \n- Select \"Import group\" \n- Connect to the remote instance with your [personal access token](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html)\n- Select the source groups you want to import \n- Click \"Import xyz groups\"\n\n![bulk_imports_v14_1](https://about.gitlab.com/images/blogimages/2022-11-15-gitlab-importers/bulk_imports_v14_1.png)\n\n## File-based import/export (the previously used method)\n\nGroup migration is the preferred method to migrate content from one GitLab instance to another, as it automates the process and you can import many groups in a single click. However, for some use cases, such as air-gapped networks when you don't have network connection between the two instances, or when you have environments with limited connectivity, the group migration won't help because it requires connection between the two instances. File-based export/import for [groups](https://docs.gitlab.com/ee/user/group/settings/import_export.html) and [projects](https://docs.gitlab.com/ee/user/project/settings/import_export.html) can be used when there is no connectivity between the instances. \n\nFile-based export/import is a manual process and requires a few steps in order to migrate each group or project. The file-based import/export is available from the UI and in the API. The team plans to disable it by a feature flag soon to encourage users to use group migration. However, you will be able to enable the feature flag in your instance if your use case requires the file-based import/export. More info can be found in this [issue](https://gitlab.com/gitlab-org/gitlab/-/issues/363406).\n\n## Import projects from external tools  \n\nGitLab has built-in support for import projects from [a variety of tools](https://docs.gitlab.com/ee/user/project/import/).\n\nThe GitHub importer is the most common importer and, therefore, the team invests a lot of effort to add more migrated components. GitLab and GitHub have different structure and architecture, so sometimes it is tricky to import objects from GitHub when the migrated components are implemented differently in GitLab. So the team needs to find creative ways to map some of the features or configurations. This is an example [epic](https://gitlab.com/groups/gitlab-org/-/epics/8585 ) with a proposal to map rules for protected branches when migrating GitHub protected rules. \n\n\n### What can be imported from GitHub to GitLab?\n\n- Repository description\n- Git repository data\n- Branch protection rules\n- Issues\n- Pull requests\n- Wiki pages\n- Milestones\n- Labels\n- Pull request review comments\n- Regular issue and pull request comments\n- Attachments for\n    - Release notes\n    - Comments and notes\n    - Issue description\n    - Merge Request description\n- Git Large File Storage (LFS) objects\n- Pull request reviews \n- Pull request “merged by” information \n- Pull request comments replies in discussions \n- Diff notes suggestions \n- Release note descriptions\n\nHere is a [full list of imported data](https://docs.gitlab.com/ee/user/project/import/github.html#imported-data).\n\nRead what's next in our [GitHub Epic](https://gitlab.com/groups/gitlab-org/-/epics/2984). \n\n### Repository by URL\n\nAn alternative way to import external projects is the Repository by URL option. You can import any Git repository through HTTP from the *Import Project* page, by choosing \"Repository by URL\".\n\nTo learn more about the Importer direction, roadmap, etc., refer to [Category Direction - Importers](/direction/manage/import_and_integrate/importers/).\n\n_Cover image by [Conny Schneider](https://unsplash.com/@choys_?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyTex) on [Unsplash](https://unsplash.com/s/photos/data-migration?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)_\n",[755,707,9],{"slug":970,"featured":6,"template":688},"gitlab-importers","content:en-us:blog:gitlab-importers.yml","Gitlab Importers","en-us/blog/gitlab-importers.yml","en-us/blog/gitlab-importers",{"_path":976,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":977,"content":982,"config":988,"_id":990,"_type":13,"title":991,"_source":15,"_file":992,"_stem":993,"_extension":18},"/en-us/blog/gitlab-incident-management",{"title":978,"description":979,"ogTitle":978,"ogDescription":979,"noIndex":6,"ogImage":876,"ogUrl":980,"ogSiteName":672,"ogType":673,"canonicalUrls":980,"schema":981},"Downtime happens, but GitLab Incident Management can help","GitLab's DevOps Platform doesn't just make it easy to release safe software faster, it also streamlines the process for problem solving. Here's a deep dive into GitLab Incident Management.","https://about.gitlab.com/blog/gitlab-incident-management","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Downtime happens, but GitLab Incident Management can help\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Itzik Gan Baruch\"}],\n        \"datePublished\": \"2021-11-30\",\n      }",{"title":978,"description":979,"authors":983,"heroImage":876,"date":984,"body":985,"category":730,"tags":986},[965],"2021-11-30","\n\nDowntime is expensive and the cost is growing. Software reliability is as important as the product itself – it doesn't matter what your product can do if your customers can't reliably access it. GitLab's Incident Management is built-in to our [DevOps Platform](/solutions/devops-platform/) and empowers teams with adaptable practices and a streamlined workflow for triage and resolving incidents. We offer tools that provide access to observability resources, such as metrics, logs, errors, runbooks, and traces, that foster easy collaboration across response teams, and that support continuous improvement via post-incident reviews and system recommendations. Here's a look at how it all works.\n\n## The costs of being down\n\nDowntime can cost companies hundreds of thousands of dollars in a single hour. Avoiding downtime is critical for organizations. Companies need to invest time, establish processes and culture around managing outages, and have processes to resolve them quickly. The larger an organization becomes, the more distributed their systems. This distribution leads to longer response times and more money lost. Investing in the right tools and fostering a culture of autonomy, feedback, quality, and automation leads to more time spent innovating and building software. If done well, teams will spend less time reacting to outages and racing to restore services. The tools your [DevOps](/topics/devops/) teams use to respond during incidents also have a huge effect on MTTR (Mean Time To Resolve, also known as Mean Time To Repair).  \n\n## What is an incident? \n\nIncidents are anomalous conditions that result in — or may lead to — service degradation or outages. Those outages can impact employee productivity, and decrease customer satisfaction and trust. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given attention and resolved.\n\n## What is Incident Management? \n\nIncident Management is a process which is focused on restoring services as quickly as possible and proactively addressing early vulnerabilities and warnings, all while keeping employees productive and customers happy. \n\n## Meet GitLab Incident Management \n\n[GitLab Incident Management](https://docs.gitlab.com/ee/operations/incident_management/) aims to decrease the overhead of managing incidents so response teams can spend more time actually resolving problems. We accelerate problem resolution through efficient knowledge sharing in the same tool they already use to collaborate on development. Enabling teams to quickly gather resources in one central, aggregated view gives the team a single source of truth and shortens the MTTR. \n\nGitLab’s built-in Incident management solution provides tools for the triage, response, and remediation of incidents. It enables developers to easily triage and view the alerts and incidents generated by their application. By surfacing alerts and incidents _where the code is being developed_, problems can be resolved more efficiently. \n\n## Why Incident Management within GitLab?\n\nGitLab is a [DevOps Platform](/solutions/devops-platform/), delivered as a single application. As such, we believe there are additional benefits for DevOps users to manage incidents within GitLab.\n\n1. Co-location of code, CI/CD, monitoring tools, and incidents reduces context switching and enables GitLab to correlate what would be disparate events or processes within one single control pane.\n\n2. The same interface for development collaboration and incident response streamlines the process. The developers who are on-call can use the same interface they already use every day; this prevents the incident responders from having to use a tool they are unfamiliar with and thus hampering their ability to respond to the incident.\n\n## How to manage incidents in the GitLab DevOps Platform\n\n### Create an incident manually or automatically \n\nYou can create incidents manually or enable GitLab to create incidents automatically whenever an alert is triggered. If you use PagerDuty for incidents, you can [set up a webhook with PagerDuty](https://docs.gitlab.com/ee/operations/incident_management/incidents.html#create-incidents-via-the-pagerduty-webhook) to automatically create a GitLab incident for each PagerDuty incident. \n\n![pd](https://about.gitlab.com/images/blogimages/incident-mgmt/pager.png)\n\n### Alert Management \n\n[Alerts](https://docs.gitlab.com/ee/operations/incident_management/alerts.html) are a critical entity in incident management workflow. They represent a notable event that might indicate a service outage or disruption. GitLab can accept alerts from any source via a webhook receiver. GitLab provides a list view for triage and detail view for deeper investigation of what happened.\n\n![alert](https://about.gitlab.com/images/blogimages/incident-mgmt/alert.png)\n\n### On-Call Schedules\n\nTo maintain the availability of your software services you need to schedule on-call teams. [On-call schedule management](https://docs.gitlab.com/ee/operations/incident_management/oncall_schedules.html) is being used to create schedules for responders to rotate on-call responsibilities. Within each schedule you can add team members to rotations that last hours, weeks or days depending on your team's needs. Some teams need to be on-call just during business hours, while others have someone on-call 24/7, 365; every team is different.  \n\n![on-call](https://about.gitlab.com/images/blogimages/incident-mgmt/on-call.png)\n\n### Escalation Policies\n\n[Escalation Policies](https://docs.gitlab.com/ee/operations/incident_management/escalation_policies.html) determine when users on-call get notified and what happens if they don’t respond. They are the if/then logic that use on-call schedules to make sure teams never miss an incident. You can create an escalation policy in the GitLab project where you manage on-call schedules.\n\n![escalation](https://about.gitlab.com/images/blogimages/incident-mgmt/escalation.png) \n\n### Paging and Notifications \n\nWhen there is a new alert or incident, it is important for a responder to be notified immediately so they can triage and respond to the problem. GitLab Incident Management supports email notifications, with plans to add Slack notifications, SMS, and phone calls. \n\n\n\n\n\n\n",[707,987,9],"collaboration",{"slug":989,"featured":6,"template":688},"gitlab-incident-management","content:en-us:blog:gitlab-incident-management.yml","Gitlab Incident Management","en-us/blog/gitlab-incident-management.yml","en-us/blog/gitlab-incident-management",{"_path":995,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":996,"content":1002,"config":1007,"_id":1009,"_type":13,"title":1010,"_source":15,"_file":1011,"_stem":1012,"_extension":18},"/en-us/blog/gitlab-value-stream-analytics",{"title":997,"description":998,"ogTitle":997,"ogDescription":998,"noIndex":6,"ogImage":999,"ogUrl":1000,"ogSiteName":672,"ogType":673,"canonicalUrls":1000,"schema":1001},"The role of Value Stream Analytics in GitLab's DevOps Platform","Better DevOps teams start with value stream management. Here's how to get the most out of GitLab's Value Stream Analytics.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749668041/Blog/Hero%20Images/Understand-Highly-Technical-Spaces.jpg","https://about.gitlab.com/blog/gitlab-value-stream-analytics","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"The role of Value Stream Analytics in GitLab's DevOps Platform\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Itzik Gan Baruch\"}],\n        \"datePublished\": \"2022-01-24\",\n      }",{"title":997,"description":998,"authors":1003,"heroImage":999,"date":1004,"body":1005,"category":730,"tags":1006},[965],"2022-01-24","\n\n***\"Whenever there is a product for a customer, there is a value stream. The challenge lies in seeing it!\"*** *Learning to See - Shook & Rother*\n\nEvery company today is a software company so the level of innovation and delivery has a direct impact on revenue generation. In order to be successful, businesses must deliver an amazing digital experience, keep up with the latest technologies, deliver value at the speed demanded by customers, and do it all with zero tolerance for outages or security breaches. That's where value stream management comes into play.\n\n*“If you can’t describe what you are doing as a value stream, you don’t know what you’re doing.”* *(Martin, K. & Osterling, M. (2014). Value Stream Mapping. McGraw-Hill, p. 15.)*\n\nValue stream management(VSM) is a change in development mindset that puts the customer at the center. VSM allows teams to measure and improve the software delivery and value flow to customers. The development process is outlined from ideation until customer value realization. The focus is no longer on features and functionality – instead, organizations ensure the efforts and resources invested to deliver value to customers will improve flows that are causing bottlenecks, optimizing the cycle and shortening time to market. \n\nYou can learn more on [Value Stream Mapping](/topics/devops/value-stream-mapping/) here\n\n## An overview of GitLab's Value Stream Analytics \n\nAs part of [GitLab's DevOps Platform](/solutions/devops-platform/), Value Stream Analytics provides one shared view of the team's velocity. With insights into how long it takes the team to move from planning to monitoring, it's possible to pinpoint areas for improvement. Value Stream Analytics measures the time spent for each project or group. It displays the median time spent in each stage of the process by measuring from its start event to its end event. It helps identify bottlenecks in the development process, enabling management to uncover, triage, and identify the root cause of slowdowns in the software development life cycle and to quickly act on them to improve efficiency.\n\n![vsa](https://about.gitlab.com/images/blogimages/vsa/vsa_1.png)\n\n## Why are Value Stream Analytics important? \n\nThe process of efficient software delivery starts by understanding where the slowest parts are, and what are the root causes behind them. With this information it's possible to build a plan for optimization.  \n\n## Which DevOps stages are tracked? \n\nThe stages tracked by Value Stream Analytics by default represent GitLab's DevOps Platform flow - \n**Issue**, **Plan**, **Code**, **Test**, **Review** and **Staging**.  \n\n![vsa](https://about.gitlab.com/images/blogimages/vsa/vsa_stages.png)\n\n## How to customize GitLab's Value Stream Analytics \n\nNote: The stages can be customized in group evel Value Stream Analytics; currently no customization is available in the project level. \n\nClick Edit in the Value Stream Management \n\n![vsa](https://about.gitlab.com/images/blogimages/vsa/vsa_4.png)\n\nClick Add another stage \n\n![vsa](https://about.gitlab.com/images/blogimages/vsa/vsa_5.png)\n\nDefine stage name, and select start event and end event from the list. \n\n![vsa](https://about.gitlab.com/images/blogimages/vsa/vsa_6.png)\n\n![vsa](https://about.gitlab.com/images/blogimages/vsa/vsa_7.png)\n\n## The key metrics \n\nThe dashboard includes useful key metrics which help to understand the team performance. If, for example, the values of **new issues**, **commits** and **deploys** are high, it's clear a team is productive. The DevOps metrics commonly known as the **DORA (DevOps Research and Assessment) 4**. The [DORA 4 metrics](https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance) show the value the team delivered to customers.\n\n**Deployment Frequency** shows how often code is deployed to production and brings value to end users. **Lead time for changes** measures how long it takes a change to get into production. Like deployment frequency, this metric measures team velocity.\n\n![vsa](https://about.gitlab.com/images/blogimages/vsa/vsa_metrics.png)\n\n## The importance of Value Stream Analytics within GitLab\n\nGitLab is a complete DevOps Platform, delivered as a single application. As such, teams use the same application during the development process from planning to monitoring. One of the benefits of being a single application for the entire DevOps lifecycle is that the data flows from all DevOps stages and is available for analysis, so Value Stream Analytics correlates and identifies how teams are spending their time without the need to integrate with an external tool. \n\nLearn more about [Value Stream Analytics for projects](https://docs.gitlab.com/ee/user/analytics/value_stream_analytics.html) and [Value Stream Analytics for groups](https://docs.gitlab.com/ee/user/group/value_stream_analytics/).\n\nTake a deeper dive into what DORA calls [elite DevOps teams](/blog/how-to-make-your-devops-team-elite-performers/).\n\n\n\n\n\n\n\n\n\n\n",[707,9,732],{"slug":1008,"featured":6,"template":688},"gitlab-value-stream-analytics","content:en-us:blog:gitlab-value-stream-analytics.yml","Gitlab Value Stream Analytics","en-us/blog/gitlab-value-stream-analytics.yml","en-us/blog/gitlab-value-stream-analytics",{"_path":1014,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1015,"content":1020,"config":1026,"_id":1028,"_type":13,"title":1029,"_source":15,"_file":1030,"_stem":1031,"_extension":18},"/en-us/blog/gitlab-value-stream-management-and-dora",{"title":1016,"description":1017,"ogTitle":1016,"ogDescription":1017,"noIndex":6,"ogImage":722,"ogUrl":1018,"ogSiteName":672,"ogType":673,"canonicalUrls":1018,"schema":1019},"Improving visibility: GitLab's value stream and DORA metrics","Optimize DevOps with the new DORA metrics in GitLab Value Stream Management.","https://about.gitlab.com/blog/gitlab-value-stream-management-and-dora","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Break the black box of software delivery with GitLab Value Stream Management and DORA Metrics\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Haim Snir\"}],\n        \"datePublished\": \"2022-06-20\",\n      }",{"title":1021,"description":1017,"authors":1022,"heroImage":722,"date":1023,"body":1024,"category":730,"tags":1025},"Break the black box of software delivery with GitLab Value Stream Management and DORA Metrics",[818],"2022-06-20","\n\nOur customers frequently tell us that despite being very effective DevOps practitioners, they still struggle to build a data-driven DevOps culture. They find it especially hard to answer the fundamental question:\n\n_What are the right things to measure?_\n\nThis becomes more challenging in enterprise organizations when there are hundreds of different development groups, and there's no normalization between how things are done or measured. Because of this, we see a strong interest from customers for metrics that would allow them to standardize between teams and benchmark themselves against the industry.\n\n![Value Streams Analytics helps you visualize and manage the DevOps flow from ideation to customer delivery.](https://about.gitlab.com/images/blogimages/2022-06-dora-vsa-overview.png){: .shadow}\nValue Streams Analytics helps you visualize and manage the DevOps flow from ideation to customer delivery.\n{: .note.text-center}\n\n## What Are DORA Metrics? \n\nWith the continued acceleration of digital transformation, most organizations realize that technology delivery excellence is a must for long-term success and competitive advantage. After seven years of data collection and research, the [DORA's State of DevOps research program](https://www.devops-research.com/research.html) has developed and validated four metrics that measure software delivery performance: [(1) deployment frequency, (2) lead time for changes, (3) time to restore service and (4) change failure rate.](https://docs.gitlab.com/ee/user/analytics/#devops-research-and-assessment-dora-key-metrics) \n\nIn GitLab, The One DevOps Platform, [Value Stream Analytics (VSA)](/solutions/value-stream-management/) surfaces a single source of insight for each stage of the software development process. The analytics are available out of the box for teams to drive performance improvements.\n\n## What does DORA bring to Value Stream Analytics?\n\nValue Stream Analytics (VSA) measures [the entire journey from customer request to release](https://docs.gitlab.com/ee/user/group/value_stream_analytics/) and automatically displays the overall performance of the stream. Each stage in the value stream is transparent and compliant in a shared experience for everyone in the company. \n\nThis makes the VSA the single source of truth (SSoT) about what's happening within the entire software supply chain, with DORA’s metrics as the key measure of the value stream outputs. \n\n## How do Value Stream Analytics work?\n\nValue stream analytics measures the median time spent by issues or merge requests in each development stage.\n\nAs an example, a stage might begin with the addition of a label to an issue and end with the addition of another label:\n\n![Value stream analytics measures each stage from its start event to its end event.](https://about.gitlab.com/images/blogimages/2022-06-dora-vsa-stage.png){: .shadow}\nValue stream analytics measures each stage from its start event to its end event.\n{: .note.text-center}\n\nFor each stage, a table list displays the workflow items filtered in the context of that stage. [In stages based on labels](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#label-based-stages-for-custom-value-streams), the table will list Issues, and in stages based on Commits, it will list MRs:\n\n![The VSA MR table provides a deeper insight into stage time breakdown .](https://about.gitlab.com/images/blogimages/2022-06-dora-vsa-mr.png){: .shadow}\nThe VSA MR table provides a deeper insight into stage time breakdown.\n{: .note.text-center}\n\nThe tables provide a deep dive into the stage performance and allow users to answer questions such as:\n\n- How to easily see bottlenecks that are slowing down the delivery of value to customers?\n- How to reduce the time spent in each stage so I can deliver features faster and stay competitive? \n- How can we develop code faster?\n- How can we hand off to QA faster?  How can we push changes to Production more quickly?\n\nUsing the Filter results text box, you can filter by a project (example below) or parameter (e.g., Milestone, Label). \n\n![Value stream analytics filtering.](https://about.gitlab.com/images/blogimages/2022-06-dora-vsa-filter.png){: .shadow}\nValue stream analytics filtering.\n{: .note.text-center}\n\nNo login is required to view [Value stream analytics for projects](https://gitlab.com/gitlab-org/gitlab/-/value_stream_analytics) where you can become familiar with stream filtering, default stages and deep-dive tables. For a full view of the DORA metrics, you have to log in with your GitLab [Ultimate-tier](https://about.gitlab.com/pricing/) account or sign up for a [free trial](https://about.gitlab.com/free-trial/).\n\n## How to understand DevOps maturity and benchmark progress with the DORA metrics?\n\nDORA metrics can also provide answers to questions not related to VSA, such as:\n\n- How to become an elite team of DevOps professionals?\n- How do I perform vs. industry standards? \n- Is the organization better at DevOps this year than last?\n\n## Learn more about VSA and DORA:\n\n- Check out the GitLab Speed Run about DORA metrics in VSA:\n\u003Ciframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/wQU-mWvNSiI\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen>\u003C/iframe>\n\n- [GitLab DORA metrics API documentation](https://docs.gitlab.com/ee/api/dora/metrics.html)\n\n- [Step-by-step instructions for creating a custom value stream](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#create-a-value-stream-with-gitlab-default-stages)\n",[843,707,823,9,732],{"slug":1027,"featured":6,"template":688},"gitlab-value-stream-management-and-dora","content:en-us:blog:gitlab-value-stream-management-and-dora.yml","Gitlab Value Stream Management And Dora","en-us/blog/gitlab-value-stream-management-and-dora.yml","en-us/blog/gitlab-value-stream-management-and-dora",{"_path":1033,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1034,"content":1040,"config":1046,"_id":1048,"_type":13,"title":1049,"_source":15,"_file":1050,"_stem":1051,"_extension":18},"/en-us/blog/gitlabs-maven-dependency-proxy-is-available-in-beta",{"title":1035,"description":1036,"ogTitle":1035,"ogDescription":1036,"noIndex":6,"ogImage":1037,"ogUrl":1038,"ogSiteName":672,"ogType":673,"canonicalUrls":1038,"schema":1039},"GitLab's Maven dependency proxy is available in Beta","Enterprises can use new package registry feature to consolidate artifact management on GitLab, increasing the efficiency and speed of CI/CD pipelines.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749663908/Blog/Hero%20Images/2023-devsecops-report-blog-banner2.png","https://about.gitlab.com/blog/gitlabs-maven-dependency-proxy-is-available-in-beta","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"GitLab's Maven dependency proxy is available in Beta\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Tim Rizzi\"}],\n        \"datePublished\": \"2023-12-11\",\n      }",{"title":1035,"description":1036,"authors":1041,"heroImage":1037,"date":1043,"body":1044,"category":948,"tags":1045},[1042],"Tim Rizzi","2023-12-11","GitLab is introducing the Maven dependency proxy, a new feature that will enable enterprises to consolidate on the DevSecOps platform for artifact management. The Maven dependency proxy, [available in Beta](https://gitlab.com/groups/gitlab-org/-/epics/3610), enables larger organizations to be more efficient by expanding the functionality of GitLab's package registry. The new feature can make pipelines faster and more reliable, and can reduce the cost of data transfer since over time most packages will be pulled from the cache.\n\n## How the Maven dependency proxy works\n\nA typical software project relies on a variety of dependencies, which we call packages. Packages can be internally built and maintained, or sourced from a public repository. Based on our user research, we’ve learned that most projects use a 50/50 mix of public vs. private packages. When installing packages, the order in which they are found and downloaded is very important, as downloading or using an incorrect package or version of a package can introduce breaking changes and security vulnerabilities into their pipelines.\n\nThe Maven dependency proxy gives users the ability to add or configure one external Java repository. Once added, when a user tries to install a Java package using their project-level endpoint, GitLab will first look for the package in the project and if it's not found, will attempt to pull the package from the external repository.\n\nWhen a package is pulled from the external repository, it will be imported into the GitLab project so that the next time that particular package/version is pulled it's pulled from GitLab and not the external repository. If the external repository is having connectivity issues and the package is present in the dependency proxy, then pulling that package will work. This will make your pipelines faster and more reliable.\n\nIf the package changes in the external repository — for example, a user deletes a version and publishes a new one with different files — the dependency proxy will detect that and invalidate the package in GitLab to pull the \"newer\" one. This will ensure that the correct packages are downloaded and help to reduce security vulnerabilities. \nIf the package is not found in their GitLab project or the external repository, GitLab will return an error.\n\nHere are more details of the Maven dependency proxy:\n- This feature and all future dependency proxy formats will be in the Premium tier.\n- Project owners will be able to configure this feature via a project's settings (API or UI).\n- We will support external repositories that require authentication, such as Artifactory or Sonatype.\n\n## A fit for the enterprise\n\nEnterprise organizations that need to consolidate on GitLab and move away from Artifactory or Sonatype can make use of the new Maven dependency proxy. Virtual registries allow you to publish, proxy, and cache multiple package repositories behind a single, logical URL. \n\nThe Maven dependency proxy is the MVC of a set of features that will help enterprise organizations sunset their existing artifact management vendors, such as Artifactory or Sonatype Nexus, to help reduce costs and improve the developer user experience.\n\n#### Roadmap\n- [Finish the Maven dependency proxy](https://gitlab.com/groups/gitlab-org/-/epics/3610) (Milestone 16.7)\n- [npm dependency proxy](https://gitlab.com/groups/gitlab-org/-/epics/3608) \n- [Make the dependency proxy for containers work generically with any container registry](https://gitlab.com/groups/gitlab-org/-/epics/6061)\n- [PyPI dependency proxy](https://gitlab.com/groups/gitlab-org/-/epics/3612)\n- [NuGet dependency proxy](https://gitlab.com/groups/gitlab-org/-/epics/3611)\n\n## How we will measure success\n\nWe will start to measure success by tracking adoption by tier with the following metrics:\n\n- Number of packages pulled through the dependency proxy\n- The hit ratio (packages pulled from the cache vs. upstream repository)\n- Number of users that pulled a package through the dependency proxy\n\n## How to get started\n\nIn the video below, you can see a short demo of the Maven dependency proxy in action.\n\n\u003C!-- blank line -->\n\u003Cfigure class=\"video_container\">\n  \u003Ciframe src=\"https://www.youtube.com/embed/9NPTXObsSrE?si=MFWg5C9j5a97LBeE\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\u003C!-- blank line -->\n\n### Prerequisites\n\n- As of the time of writing this, the feature is behind a feature flag.\n- The settings for your project must be updated using [GraphQL](https://gitlab.com/-/graphql-explorer).\n\n> Join the Beta program by adding a comment to [this epic](https://gitlab.com/groups/gitlab-org/-/epics/3610). Note: The feature is planned to go to general availability in Version 16.7 or 16.8.\n",[948,108,9,758],{"slug":1047,"featured":90,"template":688},"gitlabs-maven-dependency-proxy-is-available-in-beta","content:en-us:blog:gitlabs-maven-dependency-proxy-is-available-in-beta.yml","Gitlabs Maven Dependency Proxy Is Available In Beta","en-us/blog/gitlabs-maven-dependency-proxy-is-available-in-beta.yml","en-us/blog/gitlabs-maven-dependency-proxy-is-available-in-beta",{"_path":1053,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1054,"content":1060,"config":1066,"_id":1068,"_type":13,"title":1069,"_source":15,"_file":1070,"_stem":1071,"_extension":18},"/en-us/blog/gitlabs-next-generation-container-registry-is-now-available",{"title":1055,"description":1056,"ogTitle":1055,"ogDescription":1056,"noIndex":6,"ogImage":1057,"ogUrl":1058,"ogSiteName":672,"ogType":673,"canonicalUrls":1058,"schema":1059},"GitLab's next-generation container registry is now available","Self-managed customers can upgrade to the container registry (Beta) and unlock online garbage collection, which can reduce costly downtime and storage.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749683098/Blog/Hero%20Images/container-cloud__1_.png","https://about.gitlab.com/blog/gitlabs-next-generation-container-registry-is-now-available","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"GitLab's next-generation container registry is now available\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Tim Rizzi\"}],\n        \"datePublished\": \"2023-12-04\",\n      }",{"title":1055,"description":1056,"authors":1061,"heroImage":1057,"date":1062,"body":1063,"category":948,"tags":1064},[1042],"2023-12-04","**TLDR; Upgrade to the new container registry (Beta) to unlock online garbage collection. This [issue](https://gitlab.com/gitlab-org/gitlab/-/issues/423459) has all the information you need to get started.**\n\nWhen I joined the GitLab Package stage, the [container registry](https://docs.gitlab.com/ee/user/packages/container_registry/) already existed and was a critical feature for GitLab and GitLab's customers. But some fundamental problems needed to be addressed.\n\n- The user interface was unusable due to missing functionality like sorting, filtering, and deleting container images.\n- Operations that required listing the tags associated with an image were not performant at scale.\n- There was no good way to delete container images programmatically.\n- We had very little insight into user adoption.\n- The storage costs for GitLab.com were tremendously high.\n\nOf course, all of the above issues were related. The container registry was using a fork of the [Distribution](https://github.com/distribution/distribution) project, which had a lot of performance and usability issues when operating at the GitLab.com scale.\n\nAs a team, we decided that the first problem to tackle was the ever-growing cost of storage for GitLab.com. The legacy registry did not support online garbage collection. After calculating that it would take an absurd amount of downtime to run garbage collection in offline mode, we moved on to our next idea: optimize the existing [offline garbage collector](https://gitlab.com/groups/gitlab-org/-/epics/2552).\n\n## Optimizing the container registry code\n\nWe optimized the code for Google Cloud Storage (GCS) and Amazon S3, and saw a 90% reduction in the time it takes to run garbage collection. This benefited many GitLab customers with container registries smaller than 100 TB. Even with the performance improvements, we estimated a staggering 64 days to run garbage collection for GitLab.com.\n\nIn the end, we took the Distribution project as far as we could. We needed a container registry that supported more advanced use cases than push and pull. And we needed to drastically reduce the operating costs to make the feature sustainable for Free tier users. We decided to [fork the Distribution project](https://gitlab.com/groups/gitlab-org/-/epics/2552) and build the next-generation container registry.\n\n## Solving the online garbage collection problem\n\nNext, we dove head first into solving the [online garbage collection](https://gitlab.com/groups/gitlab-org/-/epics/2313) problem for GitLab.com. Faced with petabytes of scale and the requirement to maintain our error budgets, we designed and implemented an [online migration of GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/5523) with zero degradation in service.\n\nWe completed the migration 12 months ago. The results?\n\n- Garbage collection deletes terabytes of data from GitLab.com each day.\n- Improved performance and reliability.\n- We removed a lot of data from object storage and saved a lot of money.\n\n## Migrating to the next-generation container registry\n\nNow we want to help GitLab self-managed customers migrate to the next-generation container registry. By upgrading, you will unlock support for online garbage collection, which can save you costly downtime or escalating storage costs. You can also expect to see performance and reliability improvements for the container registry API and UI.\n\nAnother benefit is that you get to give early feedback to the team on what's working well or not so well for you. This feedback is valuable for GitLab and your organization because we will ensure that the next set of features being developed meets your needs.\n\n## The road ahead\n\nNew features are coming. Now that the registry leverages a metadata database for efficient queries, we can deliver significant UI and UX improvements that were impossible before. In 2024, we plan to add support for the below features.\n\n- [Making the container registry GA for self-managed customers](https://gitlab.com/groups/gitlab-org/-/epics/5521)\n- [Improved sorting and filtering with the container registry](https://gitlab.com/groups/gitlab-org/-/epics/8507)\n- [Improved UI for manifest/multi-arch container images](https://gitlab.com/groups/gitlab-org/-/epics/11952)\n- [Improved UI for container image attestation and signing](https://gitlab.com/groups/gitlab-org/-/epics/7856)\n- [Improved UI for storing Helm charts in the registry](https://gitlab.com/gitlab-org/gitlab/-/issues/38047)\n- Add support for [protected repositories](https://gitlab.com/groups/gitlab-org/-/epics/9825) and [immutable tags](https://gitlab.com/gitlab-org/container-registry/-/issues/82)\n\n**Note:** While the registry is in `Beta` for self-managed, we will be adding new features to GitLab.com that will not be immediately available to self-managed until the registry is generally available. This is to ensure that we focus on migrating as many customers as possible as efficiently as possible.\n\n## Get started today\n\nWe want to enable those features for self-managed customers, but we need your help. Please consider migrating to the next-generation container registry today. The best place to start is the [feedback issue](https://gitlab.com/gitlab-org/gitlab/-/issues/423459), which has links to documentation, helpful tips, and the attention of the Package team here at GitLab.\n\n_Disclaimer: This blog contains information related to upcoming products, features, and functionality. It is important to note that the information in this blog post is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab._",[1065,9,781],"careers",{"slug":1067,"featured":90,"template":688},"gitlabs-next-generation-container-registry-is-now-available","content:en-us:blog:gitlabs-next-generation-container-registry-is-now-available.yml","Gitlabs Next Generation Container Registry Is Now Available","en-us/blog/gitlabs-next-generation-container-registry-is-now-available.yml","en-us/blog/gitlabs-next-generation-container-registry-is-now-available",{"_path":1073,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1074,"content":1080,"config":1085,"_id":1087,"_type":13,"title":1088,"_source":15,"_file":1089,"_stem":1090,"_extension":18},"/en-us/blog/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x",{"title":1075,"description":1076,"ogTitle":1075,"ogDescription":1076,"noIndex":6,"ogImage":1077,"ogUrl":1078,"ogSiteName":672,"ogType":673,"canonicalUrls":1078,"schema":1079},"How a fix in Go 1.9 sped up our Gitaly service by 30x","After noticing a worrying pattern in Gitaly's performance, we uncovered an issue with fork locking affecting virtual memory size. Here's how we figured out the problem and how to fix it.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749666775/Blog/Hero%20Images/cover.jpg","https://about.gitlab.com/blog/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How a fix in Go 1.9 sped up our Gitaly service by 30x\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Andrew Newdigate\"}],\n        \"datePublished\": \"2018-01-23\",\n      }",{"title":1075,"description":1076,"authors":1081,"heroImage":1077,"date":1082,"body":1083,"category":681,"tags":1084},[942],"2018-01-23","\n\n[Gitaly](https://gitlab.com/gitlab-org/gitaly) is a Git RPC service that we are currently rolling out\nacross GitLab.com, to replace our legacy NFS-based file-sharing solution. We expect it to be faster, more stable\nand the basis for amazing new features in the future.\n\nWe're still in the process of porting Git operations to Gitaly, but the service has been\nrunning in production on GitLab.com for about nine months, and currently peaks at about 1,000\n[gRPC](https://grpc.io/) requests per second. We expect the migration effort to be completed\nby the beginning of April at which point all Git operations in the GitLab application will\nuse the service and we'll be able to decommission NFS infrastructure.\n\n\u003C!-- more -->\n\n## Worrying performance improvements\n\nThe first time we realized that something might be wrong was shortly after we'd finished deploying a new release.\n\nWe were monitoring the performance of one of the gRPC endpoints for the Gitaly service and noticed that the\n99th percentile performance of the endpoint had dropped from 400ms down to 100ms.\n\n![400ms to 100ms latency drop](https://about.gitlab.com/images/blogimages/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/graph-01.png){: .shadow.center}\nLatencies drop from 400ms to 100ms after a deploy, for no good reason\n{: .note .text-center}\n\nThis should have been fantastic news, but it wasn't. There were no changes that should have led to faster\nresponse times. We hadn't optimized anything in that release; we hadn't changed the runtime and the new\nrelease was using the same version of Git.\n\nEverything _should have_ been exactly the same.\n\nWe started digging into the data a little more and quickly realised that 400ms is a very high latency for\nan operation that simply confirms the existence of a [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References).\n\nHow long had it been this way? Well it started about 24 hours after the previous deployment.\n\n![100ms to 400ms latency hike](https://about.gitlab.com/images/blogimages/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/graph-02.png){: .shadow.center}\nLatencies rising over a 24 hour period following a deployment, for no good reason\n{: .note .text-center}\n\nWhen browsing our Prometheus performance data, it quickly became apparent that this pattern was being repeated with each\ndeployment: things would start fast and gradually slow down. This was occurring across all endpoints. It had been this way for a while.\n\nThe first assumption was that there was some sort of resource leak in the application, causing the host to slow\ndown over time. Unfortunately the data didn't back this up. CPU usage of the Gitaly service did increase, but the\nhosts still had lots of capacity.\n\n![Gitaly CPU charts](https://about.gitlab.com/images/blogimages/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/graph-03.png){: .shadow.center}\nGitaly CPU increasing with process age, but not enough to explain the problem\n{: .note .text-center}\n\nAt this point, we still didn't have any good leads as to the cause of the problem, so we decided to further\nimprove the observability of the application by adding [pprof profiling support](https://golang.org/pkg/net/http/pprof/)\nand [cAdvisor](https://github.com/google/cadvisor) metrics.\n\n## Profiling\n\nAdding pprof support to a Go process is [very easy](https://gitlab.com/gitlab-org/gitaly/merge_requests/442).\nThe process already has a Prometheus listener and we added a pprof handler on the same listener.\n\nSince production teams would need to be able to perform the profiling without our assistance, we\nalso [added a runbook](https://gitlab.com/gitlab-com/runbooks/blob/master/howto/gitaly-profiling.md).\n\nGo's pprof support is easy to use and in our testing, we found that the overhead it\nadded to production workloads was negligible, meaning we could use it in production without concern\nabout the impact it would have on site performance.\n\n## cAdvisor\n\nThe Gitaly service spawns Git child processes for many of its endpoints. Unfortunately these Git\nchild processes don't have the same instrumentation as the parent process so it was\ndifficult to tell if they were contributing to the problem. (Note: we record [`getrlimit(2)`](http://man7.org/linux/man-pages/man2/getrlimit.2.html) metrics for Git processes but cannot observe grandchild processes spawned by Git, which often do much of the heavy lifting)\n\nOn GitLab.com, Gitaly is managed through systemd, which will automatically create a cgroup for\neach service it manages.\n\nThis means that Gitaly and its child processes are contained within a single cgroup, which we\ncould monitor with [cAdvisor](https://github.com/google/cadvisor), a Google monitoring tool\nwhich supports cgroups and is compatible with Prometheus.\n\nAlthough we didn't have direct metrics to determine the behavior of the Git processes, we could\ninfer it using the cgroup metrics and the Gitaly process metrics: the difference between the\ntwo would tell us the resources (CPU, memory, etc) being consumed by the Git child processes.\n\nAt our request, the production team [added cAdvisor to the Gitaly servers](https://gitlab.com/gitlab-com/infrastructure/issues/3307).\n\nHaving cAdvisor gives us the ability to know what the Gitaly service, including all its child\nprocesses, is doing.\n\n![cAdvisor graphs for the Gitaly cgroup](https://about.gitlab.com/images/blogimages/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/graph-04.png){: .shadow.center}\ncAdvisor graphs of the Gitaly cgroup\n{: .note .text-center}\n\n## From bad to worse. Much, much worse...\n\nIn the meantime, **[the situation had got far worse](https://gitlab.com/gitlab-org/gitaly/issues/823)**.\n Instead of only seeing gradual latency increases over time, we were now seeing far more serious lockups.\n\nIndividual Gitaly server instances would grind to a halt, to the point where all new incoming TCP connections\nwere not being accepted. This proved to be a problem to using pprof: during the lockup the connection\nwould time out when attempting to profile the process. Since the reason we added pprof was to observe the\nprocess under duress, that approach was a bust.\n\nInterestingly, during a lock-up, CPU would actually decrease – the system was not overloaded, but actually\n _idled_. Iops, iowait and CPU would all drop way down.\n\nEventually, after a few minutes the service would recover and there would be a surge in backlogged\nrequests. Usually though, as soon as the state was detected, the production team would restart the\nservice manually.\n\nThe team spent a significant amount of time trying to recreate the problem locally, with little success.\n\n## Forking locks\n\nWithout pprof, we fell back to [SIGABRT thread dumps](http://pro-tips-dot-com.tumblr.com/post/47677612115/kill-a-hung-go-process-and-print-stack-traces)\nof hung processes. Using these, we determined that the process had a large amount of contention around [`syscall.ForkLock`](https://gitlab.com/gitlab-org/gitaly/issues/823#note_50951140)\nduring the lockups. In one dump, 1,400 goroutines were blocked waiting on `ForkLock` – most for several minutes.\n\n`syscall.ForkLock` has [the following documentation](https://github.com/golang/go/blob/release-branch.go1.8/src/syscall/exec_unix.go#L17):\n\n> Lock synchronizing creation of new file descriptors with fork.\n\nEach Gitaly server instance was `fork/exec`'ing Git processes about 20 times per second so we seemed to finally have a very promising lead.\n\n## Serendipity\n\n[Researching ForkLock](https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/9365#note_54342481) led us to an issue on the Go repository,\nopened in 2013, about switching from `fork/exec` to [`clone(2)`](https://man7.org/linux/man-pages/man2/clone.2.html) with `CLONE_VFORK` and `CLONE_VM`\non systems that support it: [golang/go#5838](https://github.com/golang/go/issues/5838)\n\nThe `clone(2)` syscall with `CLONE_VFORK` and `CLONE_VM` is the same as\nthe [`posix_spawn(3)`](http://man7.org/linux/man-pages/man3/posix_spawn.3.html) c function, but the latter is easier to\nrefer to, so let's use that.\n\nWhen using `fork`, the child process will start with a copy of the parent processes' memory.\nUnfortunately this process takes longer the larger the virtual memory footprint the process has.\nEven with copy-on-write, it can take several hundred milliseconds in a memory-intensive process.\n`posix_spawn` doesn't copy the parent processes' memory space and has a roughly constant time.\n\nSome good benchmarks of `fork/exec` vs. `posix_spawn` can be found here: [https://github.com/rtomayko/posix-spawn#benchmarks](https://github.com/rtomayko/posix-spawn#benchmarks)\n\nThis seemed like a possible explanation. Over time, the virtual memory size (VMM) of the Gitaly process would increase. As VMM\nincreased, each [`fork(2)`](http://man7.org/linux/man-pages/man2/fork.2.html) syscall would take longer. As fork latency increased, `syscall.ForkLock` contention would increase.\nIf `fork` time exceeded the frequency of `fork` requests, the system could temporarily lock up entirely.\n\n(Interestingly, [`TCPListener.Accept`](https://golang.org/pkg/net/#TCPListener.Accept)\n[also interacts](https://github.com/golang/go/blob/2ea7d3461bb41d0ae12b56ee52d43314bcdb97f9/src/net/sock_cloexec.go#L20) with `syscall.ForkLock`,\nalthough only on older versions of Linux. Could this be the cause of our failure to connect to the pprof listener during a lockup?)\n\nBy some incredibly good luck, [golang/go#5838](https://github.com/golang/go/issues/5838), the switch from `fork` to `posix_spawn`, had,\nafter several years' delay, recently landed in Go 1.9, just in time for us. Gitaly had been compiled with Go 1.8.\n We quickly built and tested a new binary with Go 1.9 and manually deployed this\non one of our production servers.\n\n### Spectacular results\n\nHere's the CPU usage of Gitaly processes across the fleet:\n\n![CPU after Go 1.9](https://about.gitlab.com/images/blogimages/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/graph-05.png){: .shadow.center}\nCPU after recompiling with Go 1.9\n{: .note .text-center}\n\nHere's the 99th percentile latency figures. This chart is using a logarithmic scale, so we're talking about two orders of\nmagnitude faster!\n\n![30x latency drops with Go 1.9](https://about.gitlab.com/images/blogimages/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/graph-06.png){: .shadow.center}\nEndpoint latency after recompiling with Go 1.9 (log scale)\n{: .note .text-center}\n\n## Conclusion\n\nRecompiling with Go 1.9 solved the problem, thanks to the switch to `posix_spawn`. We learned several other lessons\nin the process too:\n\n1. Having solid application monitoring in place allowed us to detect this issue, and start investigating it, far\n   earlier than we otherwise would have been able to.\n1. [pprof](https://blog.golang.org/profiling-go-programs) can be really helpful, but may not help when a process\n   has locked up and won't accept new connections. pprof is lightweight enough that you should consider adding it to your application _before_ you need it.\n1. When all else fails, [`SIGABRT thread dumps`](http://pro-tips-dot-com.tumblr.com/post/47677612115/kill-a-hung-go-process-and-print-stack-traces) might help.\n1. [`cAdvisor`](https://github.com/google/cadvisor) is great for monitoring cgroups. Systemd services each run in\n   their own cgroup, so `cAdvisor` is an easy way of monitoring a service and all its child processes, together.\n\n[Photo](https://unsplash.com/photos/jJbQBP_yh68?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) by Javier García on [Unsplash](https://unsplash.com/search/photos/slow?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)\n{: .note}\n",[754,9],{"slug":1086,"featured":6,"template":688},"how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x","content:en-us:blog:how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x.yml","How A Fix In Go 19 Sped Up Our Gitaly Service By 30x","en-us/blog/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x.yml","en-us/blog/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x",{"_path":1092,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1093,"content":1099,"config":1106,"_id":1108,"_type":13,"title":1109,"_source":15,"_file":1110,"_stem":1111,"_extension":18},"/en-us/blog/how-is-ai-ml-changing-devops",{"title":1094,"description":1095,"ogTitle":1094,"ogDescription":1095,"noIndex":6,"ogImage":1096,"ogUrl":1097,"ogSiteName":672,"ogType":673,"canonicalUrls":1097,"schema":1098},"How is AI/ML changing DevOps?","Can DevOps help AI/ML find maturity? Here are questions to consider.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749667540/Blog/Hero%20Images/devops-team-structure.jpg","https://about.gitlab.com/blog/how-is-ai-ml-changing-devops","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How is AI/ML changing DevOps?\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Brendan O'Leary\"}],\n        \"datePublished\": \"2022-11-16\",\n      }",{"title":1094,"description":1095,"authors":1100,"heroImage":1096,"date":1102,"body":1103,"category":1104,"tags":1105},[1101],"Brendan O'Leary","2022-11-16","\n\nThe last few years have seen an explosion in artificial intelligence, [machine learning](/blog/top-10-ways-machine-learning-may-help-devops/), and other types of projects. Companies like Hugging Face and applications like [DALL-E 2](https://openai.com/dall-e-2/) have brought to the mainstream what the power of AI/ML can bring to the next generation of computing and software. As every company has become a software company over the last few decades, the ability to innovate and leverage the ever-growing amount of data that organizations have access to have become where enterprises turn to compete.\n\nHowever, a lot of AI/ML projects get stalled from several challenges that may seem familiar to software professionals who have been around since [the early days of DevOps](/blog/the-journey-to-a-devops-platform/).  Adoption and optimization of artificial intelligence and machine learning have been hampered by a lack of repeatability for experiments, a disparity of tools and information silos, and a lack of team collaboration.\n\n## A new model for data modeling\n\nOne of the first ways to look at this problem is to make sure that the mental model is in place to allow the team to reason about both the strategic vision for AI/ML at your organization. And once that has been established, also think about the tactical “jobs to be done” to lay the foundation for that work.\n\nStrategically, there are many teams that have to come together for a successful AI/ML program. First, the data has to both be acquired and transformed into a usable set of clean data. Often referred to as [“DataOps”](/blog/introducing-modelops-to-solve-data-science-challenges/) this involves the typical “ETL” or extract, load, transform processes data has to go through to be useful for teams. From there, you have to productionize the data workloads through MLOps - the experimentation, training, testing, and deployment of meaningful models based on the extracted and transformed data.\n\nAnd once those two steps are complete, you can finally understand how to make production use cases for your data. You can use AI Assisted features to focus on improving user experiences, for financial forecasting, or for general trends and analysis of various parts of your business. Given the complexity of this value chain, the various teams and skills involved, and the current mishmash of tooling, there is a lot that teams can learn from the history of DevOps as they tackle these problems.\n\n## DevOps and AI/ML\n\nMuch like the various stages of obtaining and applying AI/ML for business uses, software development consists of many varied steps with different teams and skills sets to achieve the business goals outlined. That is why years ago, folks came up with this [concept of “DevOps”](/topics/devops/)– combining teams and having them work together in a cycle of continuous improvement towards the same goals – to combat silos and inefficiencies. \n\nData science teams are using specialized tools that don't integrate with the existing software development lifecycle tools they already use. This causes teams to work in silos, creating handoff friction and resulting in finger-pointing and lack of predictability. Businesses and software teams often fail to take advantage of data, and it takes months for models to get into production by which time they may be out of date or behind competitors.  Security and data ethics are frequently treated as an afterthought. This creates risk for organizations and slows innovation. \n\n## Learning from the past\n\nIf the past decades of “DevOps” evolution have taught us anything, it's that breaking down the silos between teams through the tools and processes they are using pays off dividends for business. As your team begins their [AI/ML journey](/blog/why-ai-in-devops-is-here-to-stay/) — or if you've found yourself stalling in AI/ML initiatives already — you should consider how you can consolidate teams together, ensure they are working efficiently together, and able to collaborate without boundaries.\n\nAn explosion of tools in the space is tantalizing with the promise of “getting started” quickly. But it may not set your organization up for long-term success in these areas if those tools have the effect of separating parts of your organization from one another. Creating and sustaining an AI/ML program will require intentionality behind both the processes and tools your team is using. That allows your teams to extract, transform and load data efficiently, tune, test and deploy models effectively, and leverage AI/ML to drive value for your stakeholders for the long haul.\n","insights",[707,230,9,759],{"slug":1107,"featured":6,"template":688},"how-is-ai-ml-changing-devops","content:en-us:blog:how-is-ai-ml-changing-devops.yml","How Is Ai Ml Changing Devops","en-us/blog/how-is-ai-ml-changing-devops.yml","en-us/blog/how-is-ai-ml-changing-devops",{"_path":1113,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1114,"content":1120,"config":1126,"_id":1128,"_type":13,"title":1129,"_source":15,"_file":1130,"_stem":1131,"_extension":18},"/en-us/blog/how-to-automate-creation-of-runners",{"title":1115,"description":1116,"ogTitle":1115,"ogDescription":1116,"noIndex":6,"ogImage":1117,"ogUrl":1118,"ogSiteName":672,"ogType":673,"canonicalUrls":1118,"schema":1119},"How to automate the creation of GitLab Runners","Follow this step-by-step guide for automating runner setup using new runner creation workflows.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749664087/Blog/Hero%20Images/tanukicover.jpg","https://about.gitlab.com/blog/how-to-automate-creation-of-runners","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How to automate the creation of GitLab Runners\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Darren Eastman\"}],\n        \"datePublished\": \"2023-07-06\",\n      }",{"title":1115,"description":1116,"authors":1121,"heroImage":1117,"date":1123,"body":1124,"category":681,"tags":1125},[1122],"Darren Eastman","2023-07-06","\n\nAutomating the creation of GitLab Runners is an essential tactic in optimizing the operations and management of a runner fleet. Since announcing the [deprecation and planned removal of the legacy runner registration token](https://docs.gitlab.com/ee/architecture/blueprints/runner_tokens/#next-gitlab-runner-token-architecture) last year, there have been various questions by customers and the user community regarding the impact of the new workflow on any automation they rely on for creating and registering runners. This is a step-by-step guide for automating runner setup using the new runner creation workflows as depicted in the sequence diagram.\n\n![GitLab Runner create - sequence diagram](https://about.gitlab.com/images/blogimages/2023-06-19-how-to-automate-creating-runners/runner_create_sequence_diagram.png){: .shadow}\n\n## New terminology and concepts\nBefore we dive into the automation steps, let’s first review a few new concepts with the runner creation process and how that differs from the registration token-based method. With the `registration token` method, a `registration token` is available for the instance, for each group, and for each project. Therefore, in a large GitLab installation, with many groups, sub-groups, and projects, you can have tens of hundreds of registration tokens that any authorized user can use to connect a runner. There are two steps to authorizing a runner (the application that you install on a target computing platform) to a GitLab instance:\n1. Retrieve a registration token.\n2. Run the register command in the runner application using the previously retrieved registration token.\n\nThe workflow images below depict the runner setup steps using the registration token compared with the new runner creation process.\n\n![GitLab Runner registration workflows](https://about.gitlab.com/images/blogimages/2023-06-19-how-to-automate-creating-runners/runner_registration_workflows.png){: .shadow}\n\n### Reusable runner configurations\nNow, in the registration token method, if you authenticated multiple runners using the same registration token (a valid use case), each runner entity would be visible in the UI in a separate row in the list view. The new creation method introduces the concept of a reusable runner configuration. For example, if you have to deploy multiple runners at the instance level, each with the same configuration (executor type, tags, etc.), you simply create a runner and configuration **once**, then register each individual runner with the same authentication token that you retrieved from the first runner creation. Each of these runners is now displayed in the UI in a nested hierarchy.\n\n![Runner detailed view with shared configurations](https://about.gitlab.com/images/blogimages/2023-06-19-how-to-automate-creating-runners/runner_detail_shared_configs.png){: .shadow}\n\nWe heard from many of you that your Runners view was cluttered because each runner created received its own row in the table, even if they were the exact same configuration as 100 others. With this change, our intent is to ensure that you have the flexibility you need to configure a runner fleet at scale while ensuring that you can still easily understand and manage the fleet in the GitLab Runners view. We understand that this is a paradigm shift that may take some getting used to.\n\n## Automation steps for creating a runner\nHere are the automation steps to create a runner.\n\n### Step 1: Create an access token\nYou will first need to create an access token. A [personal access token](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html) for an administrator account will allow you to create runners at the instance, group, and project levels.\n\nIf you only need to create a group or project runner, then it is best to use a group access token or project access token, respectively. For a group or project, navigate to `Settings / Access Tokens` and create a token. You must specify a name, the token expiration date, role, and scope. For the role, select `Owner`; for the scopes, select `create_runner`.\n\nNote: The access token is only visible once in the UI. You will need to store this token in a secure location - for example, a secrets management solution such as [Hashicorp Vault](https://docs.gitlab.com/ee/ci/examples/authenticating-with-hashicorp-vault/) or the [Keeper Secrets Manager Terraform plugin](https://docs.keeper.io/secrets-manager/secrets-manager/integrations/terraform).\n\n![GitLab Runner registration workflows](https://about.gitlab.com/images/blogimages/2023-06-19-how-to-automate-creating-runners/project_access_token.png){: .shadow}\n\n### Step 2: Use the access token to create a runner in the GitLab instance\nNow that you have an access token scoped to the instance, group, or project, the next step is to use that token to create a runner automatically. In this example, we will simply invoke a POST REST endpoint in a terminal using CURL.\n\n```\ncurl -sX POST https://gitlab.example.com/api/v4/user/runners --data runner_type=group_type --data \"group_id=\u003Ctarget_group_or_project_id>\" --data \"description=software-eng-docker-builds-runner\" --data \"tag_list=\u003Cyour comma-separated tags>\" --header \"PRIVATE-TOKEN: \u003Cyour_access_token>\"\n```\n\nOnce this step is complete, the newly created runner configuration is visible in the GitLab UI. As the actual runner has not yet been configured, the status displayed is `Never contacted`.\n\nThe API will return a message with the following fields: `id`, `token`, and `token_expires_at`. You must save the value for the `token` as it will only be displayed once. \n\nAs mentioned above, a critical point to note in the new runner creation is that you can reuse the runner token value to register multiple runners. If you choose to do that, runners created with the same token will be grouped in the Runners list. Whichever runner contacted GitLab most recently will be the one whose unique data (IP address, version, last contact time and status) displays in the list. You can still view all the runners in that group _and_ compare all of their unique data by going to the details page for that runner. Each runner in the group is uniquely identified by their `system_id`.\n\nAt this point, you might ask yourself, what’s the difference between this new workflow and the workflow that relies on the registration token? The benefits are:\n1. You can now quickly identify the user that created a runner configuration. Not only does this add a layer of security compared to the old method, but it also simplifies troubleshooting runner performance issues, especially when your fleet expands.\n1. Only the creator of the runner or administrator(s) can edit crucial configuration details like tags, the ability to run untagged jobs, the setting to lock to only run jobs in the current projects it is shared with, and more.\n\n## Automation of runner install and registration\nWith the runner configuration creation steps completed, you now have a runner or runners configured in your GitLab instance and valid runner tokens that you can use to register a runner. You can manually install the runner application to a target compute host or automate the runner application installation. If you plan to host the runner on a public cloud virtual machine instance – for example, [Google Cloud Compute Engine](https://cloud.google.com/compute/docs/instances) – then a good [example pattern](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/1932#note_1172713979) provided by one of our customers for automating the runner install and registration process is as follows:\n1. Use [Terraform infrastructure as code](https://docs.gitlab.com/ee/user/infrastructure/iac/) to install the runner application to a virtual machine hosted on GCP.\n1. Use the [GCP Terraform provider](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance) and specifically the `metadata` key to automatically add the runner authentication token to the runner configuration file on the newly created GCP virtual machine.\n1. Register the newly installed runner with the target GitLab instance using a [cloud-init](https://cloudinit.readthedocs.io/en/latest/index.html#) script populated from the GCP terraform provider.\n\n**Example cloud-init script**\n\n```shell\n#!/bin/bash\napt update\n\ncurl -L \"https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh\" | bash\nGL_NAME=$(curl 169.254.169.254/computeMetadata/v1/instance/name -H \"Metadata-Flavor:Google\")\nGL_EXECUTOR=$(curl 169.254.169.254/computeMetadata/v1/instance/attributes/gl_executor -H \"Metadata-Flavor:Google\")\napt update\napt install -y gitlab-runner\ngitlab-runner register --non-interactive --name=\"$GL_NAME\" --url=\"https://gitlab.com\" --token=\"$RUNNER_TOKEN\" --request-concurrency=\"12\" --executor=\"$GL_EXECUTOR\" --docker-image=\"alpine:latest\"\nsystemctl restart gitlab-runner\n```\n\n## What's next?\nSo there you have it, an overview of how to automate runner creation, installation, and registration. To summarize in three simple steps:\n1. Use the API to create a runner token and configuration.\n1. Store the retrieved authentication token in a secrets management solution.\n1. Use infrastructure as code to install the runner application on a target compute host.\n\nOur long-term vision is to directly incorporate this automation lifecycle into the product to simplify your day-to-day runner fleet management operations.\n",[755,732,9],{"slug":1127,"featured":6,"template":688},"how-to-automate-creation-of-runners","content:en-us:blog:how-to-automate-creation-of-runners.yml","How To Automate Creation Of Runners","en-us/blog/how-to-automate-creation-of-runners.yml","en-us/blog/how-to-automate-creation-of-runners",{"_path":1133,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1134,"content":1140,"config":1146,"_id":1148,"_type":13,"title":1149,"_source":15,"_file":1150,"_stem":1151,"_extension":18},"/en-us/blog/how-we-boosted-webauthn-adoption-from-20-percent-to-93-percent-in-2-days",{"title":1135,"description":1136,"ogTitle":1135,"ogDescription":1136,"noIndex":6,"ogImage":1137,"ogUrl":1138,"ogSiteName":672,"ogType":673,"canonicalUrls":1138,"schema":1139},"How we boosted WebAuthn adoption from 20 percent to 93 percent in two days","With phishing campaigns on the rise across the industry, we accelerated rollout of a program to further enhance our security hygiene program. This is how we did it.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749682498/Blog/Hero%20Images/webauthn.jpg","https://about.gitlab.com/blog/how-we-boosted-webauthn-adoption-from-20-percent-to-93-percent-in-2-days","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we boosted WebAuthn adoption from 20 percent to 93 percent in two days\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Eric Rubin\"}],\n        \"datePublished\": \"2022-11-09\",\n      }",{"title":1135,"description":1136,"authors":1141,"heroImage":1137,"date":1143,"body":1144,"category":925,"tags":1145},[1142],"Eric Rubin","2022-11-09","\nIn light of the high-profile phishing campaigns that breached public technology companies (e.g. [Twilio](https://techcrunch.com/2022/08/08/twilio-breach-customer-data/), [Uber](https://www.wired.com/story/uber-hack-mfa-phishing/), [Dropbox](https://www.securityweek.com/hackers-stole-source-code-personal-data-dropbox-following-phishing-attack), and others), GitLab decided to accelerate the implementation of the next phase of our security hygiene program, which would further enhance our security posture. As part of this acceleration, GitLab’s IT and Security teams recommended a swift adoption of phishing-resistant authentication across the entire company.\n\n## What did we decide to implement?\n\nWe already required multi-factor authentication (MFA) for all team members to log in to Okta, our primary launching point for the SaaS applications we use. The majority of our team members were primarily using the Okta Verify mobile app for push notifications, although they also had the options of using time-based one-time password ([TOTP](https://www.techtarget.com/searchsecurity/definition/time-based-one-time-password-TOTP)) codes, or [WebAuthn/FIDO2](https://webauthn.guide/) devices such as biometric (for example, Touch ID and Face ID) or security keys. \n\nWe decided to mandate the use of WebAuthn devices as the sole method for logging into Okta and remove other methods, and to get almost all team members enrolled within 48 hours from the date of launch.\n\n## Why is using WebAuthn important?\n\nOther two-factor authentication methods have known limitations. We already prohibited the use of SMS as a method for MFA as it is vulnerable to [SIM swap attacks](https://9to5mac.com/2021/10/01/protections-against-sim-swap/#:~:text=A%20port%2Dout%20attack%20is,new%20account%2C%20which%20they%20control); additionally, SMS provides a long duration for the texted code to be used by a phisher on the legitimate website. TOTP codes have a shorter duration, but still could allow for [relay attacks](https://intel471.com/blog/otp-password-bots-telegram). Push-based MFA such as the Okta Verify mobile app is vulnerable to [MFA fatigue attacks](https://www.uber.com/newsroom/security-update), where an attacker repeatedly bombards the user in the hope that they either get frustrated and approve a notification to make it stop, or otherwise accidentally approve one. \n\nWe decided that we needed to go back to fundamentals – strong MFA that is phishing-resistant. WebAuthn uses public cryptography, which verifies that the website you are logging into is the correct one. Additionally, the website only allows specifically enrolled devices to complete the authentication. The WebAuthn device effectively takes the human out of the loop – you can’t send the credentials to a phishing site. \n\n## How did we communicate the change to mandatory WebAuthn?\n\nThe communication to team members about the transition to WebAuthn started with a company wide Slack announcement from our CEO and co-founder [Sid Sijbrandij](https://gitlab.com/sytses). The message was delivered on a Tuesday evening Pacific Time, with an implementation completion date of Thursday evening Pacific Time. \n\nWe also:\n- Created a dedicated Slack channel for team member questions.\n- Circulated a Google Doc FAQ with more than 47 questions populated by team members and answered by the [DRI](/handbook/people-group/directly-responsible-individuals/) for the implementation or other team members. At GitLab everyone is encouraged to contribute.\n- Highlighted the change in our internal newsletter.\n- Added documentation, including easy-to-follow instructions, to our [handbook](/handbook/business-technology/okta/).\n\n## How did we implement the change to WebAuthn?\n\nHow could we roll out WebAuthn so quickly, with more than 1,700 team members working remotely across more than 65 countries? We had already started the ball rolling earlier this year. First, we pre-tested with a small group of IT, and then company-wide volunteers, providing instructions for team members to use. Uptake was low though, so we knew we had to be more assertive. \n\nGitLab is a majority Mac company, so we were able to take advantage of the built-in Touch ID capability already available on team members' laptops. It was also very helpful that users were familiar with the technology from using it on their smartphones.\n\nFor the ~5% of users who are on Linux, we instructed them to use their YubiKeys, and if they didn’t already have one, we facilitated delivery via Yubico’s [YubiEnterprise Delivery](https://www.yubico.com/products/yubienterprise-delivery/). We allowed any team member who wanted a YubiKey to get one via our deal, including Mac users who wanted to use Firefox ([Touch ID isn’t supported yet](https://bugzilla.mozilla.org/show_bug.cgi?id=1536482)), those who work with their laptop docked and didn’t want a new Touch ID external keyboard, or any other reason. In all, we had about 20% of our team members take up our offer to obtain YubiKeys.\n\nOur biggest win after the start of rollout was the discovery of how to add new WebAuthn devices to Okta (such as a new laptop or smartphone) via QR code scanning. This meant that as long as team members had a single enrolled device (either their laptop or their phone), they could [self-service](/handbook/business-technology/okta/#i-want-to-add-touch-id--face-id-to-okta-for-my-mobile-device-iphone-android-tablet) the WebAuthn enrollment of a new device, without requiring IT Helpdesk support. This helped us to speed the rollout and reinforced our security posture at a quicker pace, and meant that we didn’t have to send all team members YubiKeys that would only be used in the relatively rare event of needing to enroll a new device.\n\n## Initial results \n\nAfter the Slack announcement was posted, our IT Helpdesk team held virtual “office hours” on Zoom staffed for at least two hours per region. During the virtual office hours team members could drop in and get real-time help. After 24 hours from the launch of the initiative, we found that 80% of team members had already enrolled!\n\nTo push us further along, a Slack Bot was created and customized messages were sent directly to team members who had not yet enrolled and their managers. This additional step brought our enrollment efforts to the 93% mark of our team members.\n\nAt our deadline, we implemented carefully crafted new policies in Okta, locking down the vast majority of team members to using only WebAuthn. Small exception groups were created for those on PTO (because it would be frustrating for them and create unnecessary troubleshooting requests for the IT Helpdesk), as well as some users awaiting arrival of their shipped YubiKeys.\n\nThe new Okta policy and communication efforts were quite successful for us, and we have been pleased at the low volume of support requests, given the magnitude of the change and the timeframe given.\n\n## Going forward \n\nWe know that [threat vectors are always evolving](/blog/top-challenges-to-securing-the-software-supply-chain/) and we will continue to monitor them closely. We also will continue to assess our security posture and iterate to make improvements as needed.\n\nCover image by [FLY:D](https://unsplash.com/@flyd2069) on Unsplash.\n{: .note}\n",[925,9,754],{"slug":1147,"featured":6,"template":688},"how-we-boosted-webauthn-adoption-from-20-percent-to-93-percent-in-2-days","content:en-us:blog:how-we-boosted-webauthn-adoption-from-20-percent-to-93-percent-in-2-days.yml","How We Boosted Webauthn Adoption From 20 Percent To 93 Percent In 2 Days","en-us/blog/how-we-boosted-webauthn-adoption-from-20-percent-to-93-percent-in-2-days.yml","en-us/blog/how-we-boosted-webauthn-adoption-from-20-percent-to-93-percent-in-2-days",{"_path":1153,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1154,"content":1160,"config":1167,"_id":1169,"_type":13,"title":1170,"_source":15,"_file":1171,"_stem":1172,"_extension":18},"/en-us/blog/how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes",{"title":1155,"description":1156,"ogTitle":1155,"ogDescription":1156,"noIndex":6,"ogImage":1157,"ogUrl":1158,"ogSiteName":672,"ogType":673,"canonicalUrls":1158,"schema":1159},"How we decreased GitLab repo backup times from 48 hours to 41 minutes","Learn how we tracked a performance bottleneck to a 15-year-old Git function and fixed it, leading to enhanced efficiency that supports more robust backup strategies and can reduce risk.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750097166/Blog/Hero%20Images/Blog/Hero%20Images/REFERENCE%20-%20display%20preview%20for%20blog%20images%20%282%29_2pKf8RsKzAaThmQfqHIaa7_1750097166565.png","https://about.gitlab.com/blog/how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we decreased GitLab repo backup times from 48 hours to 41 minutes\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Karthik Nayak\"},{\"@type\":\"Person\",\"name\":\"Manuel Kraft\"}],\n        \"datePublished\": \"2025-06-05\",\n      }",{"title":1155,"description":1156,"authors":1161,"heroImage":1157,"date":1164,"body":1165,"category":681,"tags":1166},[1162,1163],"Karthik Nayak","Manuel Kraft","2025-06-05","Repository backups are a critical component of any robust disaster recovery strategy. However, as repositories grow in size, the process of creating reliable backups becomes increasingly challenging.  Our own [Rails repository](https://gitlab.com/gitlab-org/gitlab) was taking 48 hours to back up — forcing impossible choices between backup frequency and system performance. We wanted to tackle this issue for our customers and for our own users internally. \n\nUltimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, __reducing backup times exponentially__. The result: lower costs, reduced risk, and backup strategies that actually scale with your codebase.\n\nThis turned out to be a Git scalability issue that affects anyone with large repositories. Here's how we tracked it down and fixed it. \n\n## Backup at scale\n\nFirst, let's look at the problem. As organizations scale their repositories and backups grow more complex, here are some of the challenges they can face:\n\n* **Time-prohibitive backups:** For very large repositories, creating a repository backup could take several hours, which can hinder the ability to schedule regular backups. \n* **Resource intensity:** Extended backup processes can consume substantial server resources, potentially impacting other operations.\n* **Backup windows:** Finding adequate maintenance windows for such lengthy processes can be difficult for teams running 24/7 operations.\n* **Increased failure risk:** Long-running processes are more susceptible to interruptions from network issues, server restarts, and system errors, which can force teams to restart the entire very long backup process from scratch.\n* **Race conditions:** Because it takes a long time to create a backup, the repository might have changed a lot during the process, potentially creating an invalid backup or interrupting the backup because objects are no longer available.\n\nThese challenges can lead to compromising on backup frequency or completeness – an unacceptable trade-off when it comes to data protection. Extended backup windows can force customers into workarounds. Some might adopt external tooling, while others might reduce backup frequency, resulting in potential inconsistent data protection strategies across organizations.\n\nNow, let's dig into how we identified a performance bottleneck, found a resolution, and deployed it to help cut backup times.\n\n## The technical challenge\n\nGitLab's repository backup functionality relies on the [`git bundle create`](https://git-scm.com/docs/git-bundle) command, which captures a complete snapshot of a repository, including all objects and references like branches and tags. This bundle serves as a restoration point for recreating the repository in its exact state.\n\nHowever, the implementation of the command suffered from poor scalability related to reference count, creating a performance bottleneck. As repositories accumulated more references, processing time increased exponentially. In our largest repositories containing millions of references, backup operations could extend beyond 48 hours.\n\n### Root cause analysis\n\nTo identify the root cause of this performance bottleneck, we analyzed a flame graph of the command during execution.\n\n![Flame graph showing command during execution](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750097176/Blog/Content%20Images/Blog/Content%20Images/image1_aHR0cHM6_1750097176388.jpg)\n\nA flame graph displays the execution path of a command through its stack trace. Each bar corresponds to a function in the code, with the bar's width indicating how much time the command spent executing within that particular function.\n\nWhen examining the flame graph of `git bundle create` running on a repository with 10,000 references, approximately 80% of the execution time is consumed by the `object_array_remove_duplicates()` function. This function was introduced to Git in the [commit b2a6d1c686](https://gitlab.com/gitlab-org/git/-/commit/b2a6d1c686) (bundle: allow the same ref to be given more than once, 2009-01-17).\n\nTo understand this change, it's important to know that `git bundle create` allows users to specify which references to include in the bundle. For complete repository bundles, the `--all` flag packages all references.\n\nThe commit addressed a problem where users providing duplicate references through the command line – such as `git bundle create main.bundle main main` - would create a bundle without properly handling the duplicated main reference. Unbundling this bundle in a Git repository would break, because it tries to write the same ref twice. The code to avoid duplication uses nested `for` loops that iterate through all references to identify duplicates. This O(N²) algorithm becomes a significant performance bottleneck in repositories with large reference counts, consuming substantial processing time.\n\n### The fix: From O(N²) to efficient mapping\n\nTo resolve this performance issue, we contributed an upstream fix to Git that replaces the nested loops with a map data structure. Each reference is added to the map, which automatically ensures only a single copy of each reference is retained for processing.\n\nThis change dramatically enhances the performance of `git bundle create` and enables much better scalability in repositories with large reference counts. Benchmark testing on a repository with 10,000 references demonstrates a 6x performance improvement.\n\n```shell\nBenchmark 1: bundle (refcount = 100000, revision = master)\n  Time (mean ± σ): \t14.653 s ±  0.203 s\t[User: 13.940 s, System: 0.762 s]\n  Range (min … max):   14.237 s … 14.920 s\t10 runs\n\nBenchmark 2: bundle (refcount = 100000, revision = HEAD)\n  Time (mean ± σ):  \t2.394 s ±  0.023 s\t[User: 1.684 s, System: 0.798 s]\n  Range (min … max):\t2.364 s …  2.425 s\t10 runs\n\nSummary\n  bundle (refcount = 100000, revision = HEAD) ran\n\t6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)\n```\n\nThe patch was accepted and [merged](https://gitlab.com/gitlab-org/git/-/commit/bb74c0abbc31da35be52999569ea481ebd149d1d) into upstream Git. At GitLab, we backported this fix to ensure our customers could benefit immediately, without waiting for the next Git release.\n\n## The result: Dramatically decreased backup times\n\nThe performance gains from this improvement have been nothing short of transformative:\n\n* **From 48 hours to 41 minutes:** Creating a backup of our largest repository (`gitlab-org/gitlab`) now takes just 1.4% of the original time.\n* **Consistent performance:** The improvement scales reliably across repository sizes.\n* **Resource efficiency:** We significantly reduced server load during backup operations.\n* **Broader applicability:** While backup creation sees the most dramatic improvement, all bundle-based operations that operate on many references benefit.\n\n## What this means for GitLab customers\n\nFor GitLab customers, this enhancement delivers immediate and tangible benefits on how organizations approach repository backup and disaster recovery planning:\n* **Transformed backup strategies**   \n  * Enterprise teams can establish comprehensive nightly schedules without impacting development workflows or requiring extensive backup windows.   \n  * Backups can now run seamlessly in the background during nightly schedules, instead of needing to be dedicated and lengthy.  \n* **Enhanced business continuity**  \n  * With backup times reduced from days to minutes, organizations significantly minimize their recovery point objectives (RPO). This translates to reduced business risk – in a disaster scenario, you're potentially recovering hours of work instead of days.  \n* **Reduced operational overhead**   \n  * Less server resource consumption and shorter maintenance windows.  \n  * Shorter backup windows mean reduced compute costs, especially in cloud environments, where extended processing time translates directly to higher bills.  \n* **Future-proofed infrastructure**   \n  * Growing repositories no longer force difficult choices between backup frequency and system performance.   \n  * As your codebase expands, your backup strategy can scale seamlessly alongside it\n\nOrganizations can now implement more robust backup strategies without compromising on performance or completeness. What was once a challenging trade-off has become a straightforward operational practice.\n\nStarting with the [GitLab 18.0](https://about.gitlab.com/releases/2025/05/15/gitlab-18-0-released/) release, all GitLab customers regardless of their license tier can already fully take advantage of these improvements for their [backup](https://docs.gitlab.com/administration/backup_restore/backup_gitlab/) strategy and execution. There is no further change in configuration required.\n\n## What's next\n\nThis breakthrough is part of our ongoing commitment to scalable, enterprise-grade Git infrastructure. While the improvement of 48 hours to 41 minutes for backup creation time represents a significant milestone, we continue to identify and address performance bottlenecks throughout our stack.\n\nWe're particularly proud that this enhancement was contributed upstream to the Git project, benefiting not just GitLab users but the broader Git community. This collaborative approach to development ensures that improvements are thoroughly reviewed, widely tested, and available to all.\n\n> Deep infrastructure work like this is how we approach performance at GitLab. Join the GitLab 18 virtual launch event to see what other fundamental improvements we're shipping. [Register today!](https://about.gitlab.com/eighteen/)",[757,708,781,9,481],{"slug":1168,"featured":90,"template":688},"how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes","content:en-us:blog:how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes.yml","How We Decreased Gitlab Repo Backup Times From 48 Hours To 41 Minutes","en-us/blog/how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes.yml","en-us/blog/how-we-decreased-gitlab-repo-backup-times-from-48-hours-to-41-minutes",{"_path":1174,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1175,"content":1181,"config":1188,"_id":1190,"_type":13,"title":1191,"_source":15,"_file":1192,"_stem":1193,"_extension":18},"/en-us/blog/how-we-designed-the-gitlab-reference-architectures",{"title":1176,"description":1177,"ogTitle":1176,"ogDescription":1177,"noIndex":6,"ogImage":1178,"ogUrl":1179,"ogSiteName":672,"ogType":673,"canonicalUrls":1179,"schema":1180},"How we designed the GitLab Reference Architectures","Take a look back with us as we dive into our Reference Architectures design journey to help users easily deploy GitLab at scale. Learn our goals, process, and what's happened in the five years since.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098651/Blog/Hero%20Images/Blog/Hero%20Images/blog-image-template-1800x945%20%282%29_52vS9ne2Hu3TElOeHep0AF_1750098651525.png","https://about.gitlab.com/blog/how-we-designed-the-gitlab-reference-architectures","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we designed the GitLab Reference Architectures\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Grant Young\"}],\n        \"datePublished\": \"2024-10-02\",\n      }",{"title":1176,"description":1177,"authors":1182,"heroImage":1178,"date":1184,"body":1185,"category":681,"tags":1186},[1183],"Grant Young","2024-10-02","We introduced the first [GitLab Reference Architectures](https://docs.gitlab.com/ee/administration/reference_architectures) five years ago. Originally developed as a partnership between the GitLab Test Platform (formally Quality Engineering) and Support teams, along with other contributors, these architectures aim to provide scalable and elastic starting points to deploy GitLab at scale, tailored to an organization's target load.\n\nSince their debut, we've been thrilled to see the impact these architectures have had on our customers as they navigate their DevSecOps journey. We continue to iterate, expand, and refine the architectures, reflecting our commitment to providing you with the latest, best-in-class guidance on deploying, scaling, and maintaining your GitLab environments.\n\nIn recognition of the five-year milestone, here is a peek behind the curtain on _how_ we designed the Reference Architectures and how that design still applies today.\n\n## The problem\n\nBefore introducing the Reference Architectures, we frequently heard from our customers about the hurdles they faced when deploying GitLab at scale to meet their performance and availability goals.\n\nWhile every GitLab environment can be considered a little unique because of the need to meet a customer's own requirements, we recognized from running GitLab.com, as well as from our larger customers, that there were common fundamentals to deploying GitLab at scale that were worth sharing. Our objective was to address customer needs while promoting deployment best practices to reduce drift and increase alignment.\n\nSimultaneously, we wanted to significantly expand our performance testing efforts. The goals of this expansion were to provide our engineering teams with a deeper understanding of performance bottlenecks, to drive improvements in GitLab's performance, and to continuously test the application moving forward to ensure it remained performant. However, to conduct meaningful performance tests, we needed a standardized GitLab environment design capable of handling the target loads.\n\nEnter the Reference Architectures.\n\n## The goals\n\nWith the need for a common architecture clear, we turned next to set the goals of this initiative, which ultimately became the following:\n\n- Performance: Ensure the architecture can handle the target load efficiently.\n- Availability: Maximize uptime and reliability wherever possible.\n- Scalability and elasticity: Ensure the architecture is scalable and elastic to meet individual customer needs.\n- Cost-effectiveness: Optimize resource allocation to avoid unnecessary expenses.\n- Maintainability: Make the architecture deployment and management as straightforward as possible with standardized configurations.\n\nIt's crucial to note that these goals were not in order and they are goals we stay true to today.\n\n## The process\n\nOnce the goals were set, we faced the challenge of designing an architecture, validating it, and making sure that it was fit for purpose and met those goals.\n\nThe process itself was relatively simple in design:\n\n- Gather metrics on existing environments and the loads they were able to handle.\n- Define a prototype architecture based on these metrics.\n- Build and test the environment to validate.\n- Adjust the environment iteratively based on the test results and metrics until we had a validated architecture that met the goals.\n\nWhile simple in design, this, of course, was not the case in practice so we got to work.\n\nFirst, we collected and reviewed the data. To that end, we reviewed metrics and logging data from GitLab.com as well as several participating large customers to correlate the environment sizes deployed to the load they were handling. To achieve this, we needed an objective and quantifiable way to measure that load across any environment, and for that we used **Requests per Seconds (RPS)**. With RPS we could see the concurrent load each environment handled and correlate this to the user count accordingly. Specifically, a user count would correlate to the full manual and automated load (such as continuous integration). From that data, we were able to correlate this across several environment sizes and start to pick out common patterns for the architectures.\n\nNext, we started with a prototype architecture that aimed to meet the goals while cross-referencing with the data we collected. In fact, we actually started this step in conjunction with the first step initially as we had a good enough idea of where to start: Taking the fundamental GitLab.com design and scaling it down for individual customer loads in cost-effective ways. This allowed us to start performance testing the prototype with the data we were analyzing to corroborate accordingly. After quite a few iterations, we had a starting point for our prototype architecture.\n\nTo thoroughly test and validate the architecture we needed to turn to performance testing and define our methodology. The approach was to target our most common endpoints with a representative test data set at RPS loads that were also representative. Then, although we had manually built the prototype architecture, we knew we needed tooling to automatically build environments and handle tasks such as updates. These efforts resulted in the [GitLab Performance Tool](https://about.gitlab.com/blog/how-were-building-up-performance-testing-of-gitlab/) and [GitLab Environment Toolkit](https://about.gitlab.com/blog/why-we-are-building-the-gitlab-environment-toolkit-to-help-deploy-gitlab-at-scale/), which I blogged about previously and which we continue to use to this day (and you can use too!).\n\nWith all the above in place we started the main work of validating the prototype architecture through multiple cycles of testing and iterating. In each cycle, we would performance test the environment, review the results and metrics, and adjust the environment accordingly. Through iteration we were able to identify what failures were real application performance issues and what were environmental, and eventually we had our first architecture. That architecture is now known as the [200 RPS or 10,000-user Reference Architecture](https://docs.gitlab.com/ee/administration/reference_architectures/10k_users.html).\n\n![GitLab Reference Architecture - 200 RPS](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098658/Blog/Content%20Images/Blog/Content%20Images/reference_architecture_aHR0cHM6_1750098658326.png)\n\n## Where Reference Architectures are today\n\nSince publishing our first validated Reference Architecture, the work has never stopped! We like to describe the architectures as living documentation, as they're constantly being improved and expanded with additions such as:\n\n- various Reference Architecture sizes based on common deployments\n- non-highly available sizes for smaller environments\n- full step-by-step documentation in collaboration with our colleagues in Technical Writing and Support\n- expanded guidance and new naming scheme to help with right sizing, scaling, and how to deal with outliers such as monorepos\n- cloud native hybrid variants where select components are run in Kubernetes\n- recommendations and guidance for cloud provider services\n- and more! Check out the [update history](https://docs.gitlab.com/ee/administration/reference_architectures/#update-history) section in the Reference Architecture documentation!\n\nAll this is driven by our [comprehensive testing program](https://docs.gitlab.com/ee/administration/reference_architectures/#validation-and-test-results) that we built alongside the Reference Architectures to continuously test that they remain fit for purpose against the latest GitLab code _every single week_ and to catch any unexpected performance issues early.\n\nAnd we're thrilled to see these efforts have helped numerous customers to date as well as our own engineering teams deliver new, exciting services. In fact, our engineering teams used the Reference Architectures to develop [GitLab Dedicated](https://about.gitlab.com/dedicated/). Five years on, our commitment is stronger than ever. The work very much continues in the same way it started to ensure you have the best-in-class guidance for your DevSecOps journey.\n\n> Learn more about [GitLab Reference Architectures](https://docs.gitlab.com/ee/administration/reference_architectures/).\n",[683,758,9,754,1187],"customers",{"slug":1189,"featured":90,"template":688},"how-we-designed-the-gitlab-reference-architectures","content:en-us:blog:how-we-designed-the-gitlab-reference-architectures.yml","How We Designed The Gitlab Reference Architectures","en-us/blog/how-we-designed-the-gitlab-reference-architectures.yml","en-us/blog/how-we-designed-the-gitlab-reference-architectures",{"_path":1195,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1196,"content":1202,"config":1208,"_id":1210,"_type":13,"title":1211,"_source":15,"_file":1212,"_stem":1213,"_extension":18},"/en-us/blog/how-we-diagnosed-and-resolved-redis-latency-spikes",{"title":1197,"description":1198,"ogTitle":1197,"ogDescription":1198,"noIndex":6,"ogImage":1199,"ogUrl":1200,"ogSiteName":672,"ogType":673,"canonicalUrls":1200,"schema":1201},"How we diagnosed and resolved Redis latency spikes with BPF and other tools","How we uncovered a three-phase cycle involving two distinct saturation points and a simple fix to break that cycle.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749667913/Blog/Hero%20Images/clocks.jpg","https://about.gitlab.com/blog/how-we-diagnosed-and-resolved-redis-latency-spikes","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we diagnosed and resolved Redis latency spikes with BPF and other tools\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Matt Smiley\"}],\n        \"datePublished\": \"2022-11-28\",\n      }",{"title":1197,"description":1198,"authors":1203,"heroImage":1199,"date":1205,"body":1206,"category":681,"tags":1207},[1204],"Matt Smiley","2022-11-28","\n\nIf you enjoy performance engineering and peeling back abstraction layers to ask underlying subsystems to explain themselves, this article’s for you. The context is a chronic Redis latency problem, and you are about to tour a practical example of using BPF and profiling tools in concert with standard metrics to reveal unintuitive behaviors of a complex system.\n\nBeyond the tools and techniques, we also use an iterative hypothesis-testing approach to compose a behavior model of the system dynamics. This model tells us what factors influence the problem's severity and triggering conditions.\n\nUltimately, we find the root cause, and its remedy is delightfully boring and effective. We uncover a three-phase cycle involving two distinct saturation points and a simple fix to break that cycle. Along the way, we inspect aspects of the system’s behavior using stack sampling profiles, heat maps and flamegraphs, experimental tuning, source and binary analysis, instruction-level BPF instrumentation, and targeted latency injection under specific entry and exit conditions.\n\nIf you are short on time, the takeaways are summarized at the end. But the journey is the fun part, so let's dig in!\n\n## Introducing the problem: Chronic latency \n\nGitLab makes extensive use of Redis, and, on GitLab.com SaaS, we use [separate Redis clusters](/handbook/engineering/infrastructure/production/architecture/#redis-architecture) for certain functions. This tale concerns a Redis instance acting exclusively as a least recently used (LRU) cache.\n\nThis cache had a chronic latency problem that started occurring intermittently over two years ago and in recent months had become significantly worse: Every few minutes, it suffered from bursts of very high latency and corresponding throughput drop, eating into its Service Level Objective (SLO). These latency spikes impacted user-facing response times and [burned error budgets](https://gitlab.com/gitlab-org/gitlab/-/issues/360578#note_966597336) for dependent features, and this is what we aimed to solve.\n\n**Graph:** Spikes in the rate of extremely slow (1 second) Redis requests, each corresponding to an eviction burst\n\n![Graph showing spikes in the slow request rate every few minutes](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/00_redis_slow_request_rate_spikes_during_each_eviction_burst.png)\n\nIn prior work, we had already completed several mitigating optimizations. These sufficed for a while, but organic growth had resurfaced this as an important [long-term scaling problem](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#why-is-it-important-to-get-to-the-root-of-the-latency-spikes). We had also already ruled out externally triggered causes, such as request floods, connection rate spikes, host-level resource contention, etc. These latency spikes were consistently associated with memory usage reaching the eviction threshold (`maxmemory`), not by changes in client traffic patterns or other processes competing with Redis for CPU time, memory bandwidth, or network I/O.\n\nWe [initially thought](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1567) that Redis 6.2’s new [eviction throttling mechanism](https://github.com/redis/redis/pull/7653) might alleviate our eviction burst overhead. It did not. That mechanism solves a different problem: It prevents a stall condition where a single call to `performEvictions` could run arbitrarily long. In contrast, during this analysis we [discovered](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_977816216) that our problem (both before and after upgrading Redis) was related to numerous calls collectively reducing Redis throughput, rather than a few extremely slow calls causing a complete stall.\n\nTo discover our bottleneck and its potential solutions, we needed to investigate Redis’s behavior during our workload’s eviction bursts.\n\n## A little background on Redis evictions\n\nAt the time, our cache was oversubscribed, trying to hold more cache keys than the [configured `maxmemory` threshold](https://redis.io/docs/reference/eviction/) could hold, so evictions from the LRU cache were expected. But the dense concentration of that eviction overhead was surprising and troubling.\n\nRedis is essentially single-threaded. With a few exceptions, the “main” thread does almost all tasks serially, including handling client requests and evictions, among other things. Spending more time on X means there is less remaining time to do Y, so think about queuing behavior as the story unfolds.\n\nWhenever Redis reaches its `maxmemory` threshold, it frees memory by evicting some keys, aiming to do just enough evictions to get back under `maxmemory`. However, contrary to expectation, the metrics for memory usage and eviction rate (shown below) indicated that instead of a continuous steady eviction rate, there were abrupt burst events that freed much more memory than expected. After each eviction burst, no evictions occurred until memory usage climbed back up to the `maxmemory` threshold again.\n\n**Graph:** Redis memory usage drops by 300-500 MB during each eviction burst:\n\n![Memory usage repeatedly rises gradually to 64 GB and then abruptly drops](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/01_redis_memory_usage_dips_during_eviction_bursts.png)\n\n**Graph:** Key eviction spikes match the timing and size of the memory usage dips shown above\n\n![Eviction counter shows a large spike each time the previous graph showed a large memory usage drop](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/02_redis_eviction_bursts.png)\n\nThis apparent excess of evictions became the central mystery. Initially, we thought answering that question might reveal a way to smooth the eviction rate, spreading out the overhead and avoiding the latency spikes. Instead, we discovered that these bursts are an interaction effect that we need to avoid, but more on that later.\n\n## Eviction bursts cause CPU saturation\n\nAs shown above, we had found that these latency spikes correlated perfectly with large spikes in the cache’s eviction rate, but we did not yet understand why the evictions were concentrated into bursts that last a few seconds and occur every few minutes.\n\nAs a first step, we wanted to verify a causal relationship between eviction bursts and latency spikes.\n\nTo test this, we used [`perf`](https://www.brendangregg.com/perf.html) to run a CPU sampling profile on the Redis main thread. Then we applied a filter to split that profile, isolating the samples where it was calling the [`performEvictions` function](https://github.com/redis/redis/blob/6.2.6/src/evict.c#L512). Using [`flamescope`](https://github.com/Netflix/flamescope), we can visualize the profile’s CPU usage as a [subsecond offset heat map](https://www.brendangregg.com/HeatMaps/subsecondoffset.html), where each second on the X axis is folded into a column of 20 msec buckets along the Y axis. This visualization style highlights sub-second activity patterns. Comparing these two heat maps confirmed that during an eviction burst, `performEvictions` is starving all other main thread code paths for CPU time.\n\n**Graph:** Redis main thread CPU time, excluding calls to `performEvictions`\n\n![Heat map shows one large gap and two small gaps in an otherwise uniform pattern of 70 percent to 80 percent CPU usage](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/03_heat_map_of_redis_main_thread_during_eviction_burst__excluding_performEvictions.png)\n\n**Graph:** Remainder of the same profile, showing only the calls to `performEvictions`\n\n![This heat map shows the gaps in the previous heap map were CPU time spent performing evictions](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/04_heat_map_of_redis_main_thread_during_eviction_burst__only_performEvictions.png)\n\nThese results confirm that eviction bursts are causing CPU starvation on the main thread, which acts as a throughput bottleneck and increases Redis’s response time latency.  These CPU utilization bursts typically lasted a few seconds, so they were too short-lived to trigger alerts but were still user impacting.\n\nFor context, the following flamegraph shows where `performEvictions` spends its CPU time. There are a few interesting things here, but the most important takeaways are:\n* It gets called synchronously by `processCommand` (which handles all client requests).\n* It handles many of its own deletes. Despite its name, the `dbAsyncDelete` function only delegates deletes to a helper thread under certain conditions which turn out to be rare for this workload.\n\n![Flamegraph of calls to function performEvictions, as described above](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/05_flamegraph_of_redis_main_thread_during_eviction_burst__only_performEvictions.png)\n\nFor more details on this analysis, see the [walkthrough and methodology](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_854745083).\n\n## How fast are individual calls to `performEvictions`?\n\nEach incoming request to Redis is handled by a call to `processCommand`, and it always concludes by calling the `performEvictions` function. That call to `performEvictions` is frequently a no-op, returning immediately after checking that the `maxmemory` threshold has not been breached. But when the threshold is exceeded, it will continue evicting keys until it either reaches its `mem_tofree` goal or exceeds its configured time limit per call.\n\nThe CPU heat maps shown earlier proved that `performEvictions` calls were collectively consuming a large majority of CPU time for up to several seconds.\n\nTo complement that, we also measured the wall clock time of individual calls.\n\nUsing the `funclatency` CLI tool (part of the [BCC suite of BPF tools](https://github.com/iovisor/bcc)), we measured call duration by instrumenting entry and exit from the `performEvictions` function and aggregated those measurements into a [histogram](https://en.wikipedia.org/wiki/Histogram) at 1-second intervals. When no evictions were occurring, the calls were consistently low latency (4-7 usecs/call). This is the no-op case described above (including 2.5 usecs/call of instrumentation overhead). But during an eviction burst, the results shift to a bimodal distribution, including a combination of the fast no-op calls along with much slower calls that are actively performing evictions:\n\n```\n$ sudo funclatency-bpfcc --microseconds --timestamp --interval 1 --duration 600 --pid $( pgrep -o redis-server ) '/opt/gitlab/embedded/bin/redis-server:performEvictions'\n...\n23:54:03\n     usecs               : count     distribution\n         0 -> 1          : 0        |                                        |\n         2 -> 3          : 576      |************                            |\n         4 -> 7          : 1896     |****************************************|\n         8 -> 15         : 392      |********                                |\n        16 -> 31         : 84       |*                                       |\n        32 -> 63         : 62       |*                                       |\n        64 -> 127        : 94       |*                                       |\n       128 -> 255        : 182      |***                                     |\n       256 -> 511        : 826      |*****************                       |\n       512 -> 1023       : 750      |***************                         |\n```\n\nThis measurement also directly confirmed and quantified the throughput drop in Redis requests handled per second: The call rate to `performEvictions` (and hence to `processCommand`) dropped to 20% of its norm from before the evictions began, from 25K to 5K calls per second.\n\nThis has a huge impact on clients: New requests are arriving at 5x the rate they are being completed. And crucially, we will see soon that this asymmetry is what drives the eviction burst.\n\nFor more details on this analysis, see the [safety check](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_857869826) for instrumentation overhead and the [results walkthrough](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_857907521). And for more general reference, the BPF instrumentation overhead estimate is based on these [benchmark results](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1383).\n\n## Experiment: Can tuning mitigate eviction-driven CPU saturation?\n\nThe analyses so far had shown that evictions were severely starving the Redis main thread for CPU time. There were still important unanswered questions (which we will return to shortly), but this was already enough info to [suggest some experiments](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_859236777) to test potential mitigations:\n* Can we spread out the eviction overhead so it takes longer to reach its goal but consumes a smaller percentage of the main thread’s time?\n* Are evictions freeing more memory than expected due to scheduling a lot of keys to be asynchronously deleted by the [lazyfree mechanism](https://github.com/redis/redis/blob/6.2.6/redis.conf#L1079)? Lazyfree is an optional feature that lets the Redis main thread [delegate to an async helper thread](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_859236777) the expensive task of deleting keys that have more than 64 elements. These async evictions do not count immediately towards the eviction loop’s memory goal, so if many keys qualify for lazyfree, this could potentially drive many extra iterations of the eviction loop.\n\nThe [answers](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7172#note_971197943) to both turned out to be no:\n* Reducing `maxmemory-eviction-tenacity` to its minimum setting still did not make `performEvictions` cheap enough to avoid accumulating a request backlog. It did increase response rate, but arrival rate still far exceeded it, so this was not an effective mitigation.\n* Disabling `lazyfree-lazy-eviction` did not prevent the eviction burst from dropping memory usage far below `maxmemory`. Those lazyfrees represent a small percentage of reclaimed memory. This rules out one of the potential explanations for the mystery of excessive memory being freed.\n\nHaving ruled out two potential mitigations and one candidate hypothesis, at this point we return to the pivotal question: Why are several hundred extra megabytes of memory being freed by the end of each eviction burst?\n\n## Why do evictions occur in bursts and free too much memory?\n\nEach round of eviction aims to free just barely enough memory to get back under the `maxmemory` threshold.\n\nWith a steady rate of demand for new memory allocations, the eviction rate should be similarly steady. The rate of arriving cache writes does appear to be steady. So why are evictions happening in dense bursts, rather than smoothly? And why do they reduce memory usage on a scale of hundreds of megabytes rather than hundreds of bytes?\n\nSome potential explanations to explore:\n* Do evictions only end when a large key gets evicted, spontaneously freeing enough memory to skip evictions for a while? No, the memory usage drop is far bigger than the largest keys in the dataset.\n* Do deferred lazyfree evictions cause the eviction loop to overshoot its goal, freeing more memory than intended? No, the above experiment disproved this hypothesis.\n* Is something causing the eviction loop to sometimes calculate an unexpectedly large value for its `mem_tofree` goal? We explore this next. The answer is no, but checking it led to a new insight.\n* Is a feedback loop causing evictions to become somehow self-amplifying? If so, what conditions lead to entering and leaving this state? This turned out to be correct.\n\nThese were all plausible and testable hypotheses, and each would point towards a different solution to the eviction-driven latency problem.\n\nThe first two hypotheses we have already eliminated.\n\nTo test the next two, we built custom BPF instrumentation to peek at the calculation of `mem_tofree` at the start of each call to `performEvictions`.\n\n## Observing the `mem_tofree` calculation with `bpftrace`\n\nThis part of the investigation was a personal favorite and led to a critical realization about the nature of the problem.\n\nAs noted above, our two remaining hypotheses were:\n* an unexpectedly large `mem_tofree` goal\n* a self-amplifying feedback loop\n\nTo differentiate between them, we used [`bpftrace`](https://github.com/iovisor/bpftrace) to instrument the calculation of `mem_tofree`, looking at its input variables and results.\n\nThis set of measurements directly tests the following:\n* Does each call to `performEvictions` aim to free a small amount of memory -- perhaps roughly the size of an average cache entry? If `mem_tofree` ever approaches hundreds of megabytes, that would confirm the first hypothesis and reveal what part of the calculation was causing that large value. Otherwise, it rules out the first hypothesis and makes the feedback loop hypothesis more likely.\n* Does the replication buffer size significantly influence `mem_tofree` as a feedback mechanism? Each eviction adds to this buffer, just like normal writes do. If this buffer grows large (possibly partly due to evictions) and then abruptly shrinks (due to the peer consuming it), that would cause a spontaneous large drop in memory usage, ending evictions and instantly reducing memory usage. This is one potential way for evictions to drive a feedback loop.\n\nTo peek at the values of the `mem_tofree` calculation ([script](https://gitlab.com/gitlab-com/gl-infra/scalability/uploads/cab2cd03231f8dd4819f77b44d768cb9/redis_snoop.getMaxmemoryState.sha_25a228b839a93a1395907a03f83e1eee448b0f14.production_thresholds.bt)), we needed to isolate the [correct call from `performEvictions`](https://github.com/redis/redis/blob/6.2.6/src/evict.c#L523) to the [`getMaxmemoryState`](https://github.com/redis/redis/blob/6.2.6/src/evict.c#L374-L407) function and reverse engineer its assembly to find the right instruction and register to instrument for each of the source code level variables that we wanted to capture. From that data we generate histograms for each of the following variables:\n\n```\nmem_reported = zmalloc_used_memory()        // All used memory tracked by jemalloc\noverhead = freeMemoryGetNotCountedMemory()  // Replication output buffers + AOF buffer\nmem_used = mem_reported - overhead          // Non-exempt used memory\nmem_tofree = mem_used - maxmemory           // Eviction goal\n```\n\n_Caveat:_ Our [custom BPF instrumentation](https://gitlab.com/gitlab-com/gl-infra/scalability/uploads/cab2cd03231f8dd4819f77b44d768cb9/redis_snoop.getMaxmemoryState.sha_25a228b839a93a1395907a03f83e1eee448b0f14.production_thresholds.bt) is specific to this particular build of the `redis-server` binary, since it attaches to virtual addresses that are likely to change the next time Redis is compiled. But the approach is able to be generalized. Treat this as a concrete example of using BPF to inspect source code variables in the middle of a function call without having to rebuild the binary. Because we are peeking at the function’s intermediate state and because the compiler inlined this function call, we needed to do binary analysis to find the correct instrumentation points. In general, peeking at a function’s arguments or return value is easier and more portable, but in this case it would not suffice.\n\nThe results:\n* Ruled out the first hypothesis: Each call to `performEvictions` had a small target value (`mem_tofree` \u003C 2 MB). This means each call to `performEvictions` did a small amount of work. Redis’s mysterious rapid drop in memory usage cannot have been caused by an abnormally large `mem_tofree` target evicting a big batch of keys all at once. Instead, there must be many calls collectively driving down memory usage.\n* The replication output buffers remained consistently small, ruling out one of the potential feedback loop mechanisms.\n* Surprisingly, `mem_tofree` was usually 16 KB to 64 KB, which is larger than a typical cache entry. This size discrepancy hints that cache keys may not be the main source of the memory pressure perpetuating the eviction burst once it begins.\n\nAll of the above results were consistent with the feedback loop hypothesis.\n\nIn addition to answering the initial questions, we got a bonus outcome: Concurrently measuring both `mem_tofree` and `mem_used` revealed a crucial new fact – _the memory reclaim is a completely distinct phase from the eviction burst_.\n\nReframing the pathology as exhibiting separate phases for evictions versus memory reclaim led to a series of realizations, described in the next section. From that emerged a coherent hypothesis explaining all the observed properties of the pathology.\n\nFor more details on this analysis, see [methodology notes](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_982498636), [build notes](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_982499538) supporting the disassembly of the Redis binary, and [initial interpretations](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_977994182).\n\n## Three-phase cycle\n\nWith the above results indicating a distinct separation between the evictions and the memory reclaim, we can now concisely characterize [three phases](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_982623949) in the cycle of eviction-driven latency spikes.\n\n**Graph:** Diagram (not to scale) comparing memory and CPU usage to request and response rates during each of the three phases\n\n![Diagram summarizes the text that follows, showing CPU and memory saturate in Phase 2 until request rate drops to match response rate, after which they recover](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/06_3_phase_cycle_of_eviction_bursts.png)\n\nPhase 1: Not saturated (7-15 minutes)\n* Memory usage is below `maxmemory`. No evictions occur during this phase.\n* Memory usage grows organically until reaching `maxmemory`, which starts the next phase.\n\nPhase 2: Saturated memory and CPU (6-8 seconds)\n* When memory usage reaches `maxmemory`, evictions begin.\n* Evictions occur only during this phase, and they occur intermittently and frequently.\n* Demand for memory frequently exceeds free capacity, repeatedly pushing memory usage above `maxmemory`. Throughout this phase, memory usage oscillates close to the `maxmemory` threshold, evicting a small amount of memory at a time, just enough to get back under `maxmemory`.\n\nPhase 3: Rapid memory reclaim (30-60 seconds)\n* No evictions occur during this phase.\n* During this phase, something that had been holding a lot of memory starts quickly and steadily releasing it.\n* Without the overhead of running evictions, CPU time is again spent mostly on handling requests (starting with the backlog that accumulated during Phase 2).\n* Memory usage drops rapidly and steadily. By the time this phase ends, hundreds of megabytes have been freed. Afterwards, the cycle restarts with Phase 1.\n\nAt the transition between Phase 2 and Phase 3, evictions abruptly ended because memory usage stays below the `maxmemory` threshold.\n\nReaching that transition point where memory pressure becomes negative signals that whatever was driving the memory demand in Phase 2 has started releasing memory faster than it is consuming it, shrinking the footprint it had accumulated during the previous phase.\n\nWhat is this **mystery memory consumer** that bloats its demand during Phase 2 and frees it during Phase 3?\n\n## The mystery revealed\n\n[Modeling the phase transitions](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_982651298) gave us some useful constraints that a viable hypothesis must satisfy. The mystery memory consumer must:\n* quickly bloat its footprint to hundreds of megabytes on a timescale of less than 10 seconds (the duration of Phase 2), under conditions triggered by the start of an eviction burst\n* quickly release its accumulated excess on a timescale of just tens of seconds (the duration of Phase 3), under the conditions immediately following an eviction burst\n\n**The answer:** The client input/output buffers meet those constraints to be the mystery memory consumer.\n\nHere is how that hypothesis plays out:\n* During Phase 1 (healthy state), the Redis main thread’s CPU usage is already fairly high. At the start of Phase 2, when evictions begin, the eviction overhead saturates the main thread’s CPU capacity, quickly dropping response rate below the incoming request rate.\n* This throughput mismatch between arrivals versus responses **is itself the amplifier** that takes over driving the eviction burst. As the size of that rate gap increases, the proportion of time spent doing evictions also increases.\n* Accumulating a backlog of requests requires memory, and that backlog continues to grow until enough clients are stalled that the arrival rate drops to match the response rate. As clients stall, the arrival rate falls, and with it the memory pressure, eviction rate, and CPU overhead begin to reduce.\n* At the equilibrium point when arrival rate falls to match response rate, memory demand is satisfied and evictions stop (ending Phase 2). Without the eviction overhead, more CPU time is available to process the backlog, so response rate increases above request arrival rate. This recovery phase steadily consumes the request backlog, incrementally freeing memory as it goes (Phase 3).\n* Once the backlog is resolved, the arrival and response rates match again. CPU usage is back to its Phase 1 norm, and memory usage has temporarily dropped in proportion to the max size of Phase 2’s request backlog.\n\nWe confirmed this hypothesis via a [latency injection experiment](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_987049036) showing that queuing alone explains the pathology. This outcome supports the conclusion that the extra memory demand originates from response rate falling below request arrival rate.\n\n## Remedies: How to avoid entering the eviction burst cycle\n\nNow that we understand the dynamics of the pathology, we can draw confident conclusions about viable solutions.\n\nRedis evictions are only self-amplifying when all of the following conditions are present:\n* **Memory saturation:** Memory usage reaches the `maxmemory` limit, causing evictions to start.\n* **CPU saturation:** The baseline CPU usage by the Redis main thread’s normal workload is close enough to a whole core that the eviction overhead pushes it to saturation. This reduces the response rate below request arrival rate, inducing self-amplification via increased memory demand for request buffering.\n* **Many active clients:** The saturation only lasts as long as request arrival rate exceeds response rate. Stalled clients no longer contribute to that arrival rate, so the saturation lasts longer and has a greater impact if Redis has many active clients still sending requests.\n\nViable remedies include:\n* Avoid memory saturation by any combination of the following to make peak memory usage less than the `maxmemory` limit:\n  * Reduce cache time to live (TTL)\n  * Increase `maxmemory` (and host memory if needed, but watch out for [`numa_balancing` CPU overhead](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1889) on hosts with multiple NUMA nodes)\n  * Adjust client behavior to avoid writing unnecessary cache entries\n  * Split the cache among multiple instances (sharding or functional partitioning, helps avoid both memory and CPU saturation)\n* Avoid CPU saturation by any combination of the following to make peak CPU usage for the workload plus eviction overhead be less than 1 CPU core:\n  * Use the fastest processor available for single-threaded instructions per second\n  * Isolate the redis-server process (particularly its main thread) from any other competing CPU-intensive processes (dedicated host, taskset, cpuset)\n  * Adjust client behavior to avoid unnecessary cache lookups or writes\n  * Split the cache among multiple instances (sharding or functional partitioning, helps avoid both memory and CPU saturation)\n  * Offload work from the Redis main thread (io-threads, lazyfree)\n  * Reduce eviction tenacity (only gives a minor benefit in our experiments)\n\nMore exotic potential remedies could include a new Redis feature. One idea is to exempt ephemeral allocations like client buffers from counting towards the `maxmemory` limit, instead applying that limit only to key storage. Alternatively, we could limit evictions to only consume at most a configurable percentage of the main thread’s time, so that most of its time is still spent on request throughput rather than eviction overhead.\n\nUnfortunately, either of those features would trade one failure mode for another, reducing the risk of eviction-driven CPU saturation while increasing the risk of unbounded memory growth at the process level, which could potentially saturate the host or cgroup and lead to an OOM, or out of memory, kill. That trade-off may not be worthwhile, and in any case it is not currently an option.\n\n## Our solution\n\nWe had already exhausted the low-hanging fruit for CPU efficiency, so we focused our attention on avoiding memory saturation.\n\nTo improve the cache’s memory efficiency, we [evaluated](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_990891708) which types of cache keys were using the most space and how much [`IDLETIME`](https://redis.io/commands/object-idletime/) they had accrued since last access. This memory usage profile identified some rarely used cache entries (which waste space), helped inform the TTL, or time to live, tuning by first focusing on keys with a high idle time, and highlighted some useful potential cutpoints for functionally partitioning the cache.\n\nWe [decided](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_1014582669) to concurrently pursue several cache efficiency improvements and opened an [epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/764) for it. The goal was to avoid chronic memory saturation, and the main action items were:\n* Iteratively reduce the cache’s [default TTL](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1854) from 2 weeks to 8 hours (helped a lot!)\n* Switch to [client-side caching](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_1026821730) for certain cache keys (efficiently avoids spending shared cache space on non-shared cache entries)\n* [Partition](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/762) a set of cache keys to a separate Redis instance\n\nThe TTL reduction was the simplest solution and turned out to be a big win. One of our main concerns with TTL reduction was that the additional cache misses could potentially increase workload on other parts of the infrastructure. Some cache misses are more expensive than others, and our metrics are not granular enough to quantify the cost of cache misses per type of cache entry. This concern is why we applied the TTL adjustment incrementally and monitored for SLO violations. Fortunately, our inference was correct: Reducing TTL did not significantly reduce the cache hit rate, and the additional cache misses did not cause noticeable impact to downstream subsystems.\n\nThe TTL reduction turned out to be sufficient to drop memory usage consistently a little below its saturation point.\n\nIncreasing `maxmemory` had initially not been feasible because the original peak memory demand (prior to the efficiency improvements) was expected to be larger than the max size of the VMs we use for Redis. However, once we dropped memory demand below saturation, then we could confidently [provision headroom](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1868) for future growth and re-enable [saturation alerting](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1883).\n\n## Results\n\nThe following graph shows Redis memory usage transitioning out of its chronically saturated state, with annotations describing the milestones when latency spikes ended and when saturation margin became wide enough to be considered safe:\n\n![Redis memory usage stops showing a flat top saturation](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/07_epic_results__memory_saturation_avoided_by_TTL_reductions.png)\n\nZooming into the days when we rolled out the TTL adjustments, we can see the harmful eviction-driven latency spikes vanish as we drop memory usage below its saturation point, exactly as predicted:\n\n![Redis memory usage starts as a flat line and then falls below that saturation line](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/08_results__redis_memory_usage_stops_saturating.png)\n\n![Redis response time spikes stop occurring at the exact point when memory stops being saturated](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/09_results__redis_latency_spikes_end.png)\n\nThese eviction-driven latency spikes had been the biggest cause of slowess in Redis cache.\n\nSolving this source of slowness significantly improved the user experience. This 1-year lookback shows only the long-tail portion of the improvement, not even the full benefit.  Each weekday had roughly 2 million Redis requests slower than 1 second, until our fix in mid-August:\n\n![Graph of the daily count of Redis cache requests slower than 1 second, showing roughly 2 million slow requests per day on weekdays until mid-August, when the TTL adjustments were applied](https://about.gitlab.com/images/blogimages/2022-11-28-diagnosing-redis-latency-spikes-with-bpf-and-friends/10_results__1_year_retrospective_of_slow_redis_requests_per_day.png)\n\n## Conclusions\n\nWe solved a long-standing latency problem that had been worsening as the workload grew, and we learned a lot along the way. This article focuses mostly on the Redis discoveries, since those are general behaviors that some of you may encounter in your travels. We also developed some novel tools and analytical methods and uncovered several useful environment-specific facts about our workload, infrastructure, and observability, leading to several additional improvements and proposals not mentioned above.\n\nOverall, we made several efficiency improvements and broke the cycle that was driving the pathology. Memory demand now stays well below the saturation point, eliminating the latency spikes that were burning error budgets for the development teams and causing intermittent slowness for users. All stakeholders are happy, and we came away with deeper domain knowledge and sharper skills!\n\n## Key insights summary\n\nThe following notes summarize what we learned about Redis eviction behavior (current as of version 6.2):\n* The same memory budget (`maxmemory`) is shared by key storage and client connection buffers. A spike in demand for client connection buffers counts towards the `maxmemory` limit, in the same way that a spike in key inserts or key size would.\n* Redis performs evictions in the foreground on its main thread. All time spent in `performEvictions` is time not spent handling client requests. Consequently, during an eviction burst, Redis has a lower throughput ceiling.\n* If eviction overhead saturates the main thread’s CPU, then response rate falls below request arrival rate. Redis accumulates a request backlog (which consumes memory), and clients experience this as slowness.\n* The memory used for pending requests requires more evictions, driving the eviction burst until enough clients are stalled that arrival rate falls back below response rate. At that equilibrium point, evictions stop, eviction overhead vanishes, Redis rapidly handles its request backlog, and that backlog’s memory gets freed.\n* Triggering this cycle requires all of the following:\n  * Redis is configured with a `maxmemory` limit, and its memory demand exceeds that size. This memory saturation causes evictions to begin.\n  * Redis main thread’s CPU utilization is high enough under its normal workload that having to also perform evictions drives it to CPU saturation. This reduces response rate below request rate, causing a growing request backlog and high latency.\n  * Many active clients are connected. The duration of the eviction burst and the size of memory spent on client connection buffers increases proportionally to the number of active clients.\n* Prevent this cycle by avoiding either memory or CPU saturation. In our case, avoiding memory saturation was easier (mainly by reducing cache TTL).\n\n## Further reading\n\nThe following lists summarize the analytical tools and methods cited in this article. These tools are all highly versatile and any of them can provide a massive level-up when working on performance engineering problems.\n\nTools:\n* [perf](https://www.brendangregg.com/perf.html) - A Linux performance analysis multitool. In this article, we used `perf` as a sampling profiler, capturing periodic stack traces of the `redis-server` process's main thread when it is actively running on a CPU.\n* [Flamescope](https://github.com/Netflix/flamescope) - A visualization tool for rendering a `perf` profile (and other formats) into an interactive subsecond heat map. This tool invites the user to explore the timeline for microbursts of activity or inactivity and render flamegraphs of those interesting timespans to explore what code paths were active.\n* [BCC](https://github.com/iovisor/bcc) - BCC is a framework for building BPF tools, and it ships with many useful tools out of the box. In this article, we used `funclatency` to measure the call durations of a specific Redis function and render the results as a histogram.\n* [bpftrace](https://github.com/iovisor/bpftrace) - Another BPF framework, ideal for answering ad-hoc questions about your system's behavior. It uses an `awk`-like syntax and is [quick to learn](https://github.com/iovisor/bpftrace#readme). In this article, we wrote a [custom `bpftrace` script](https://gitlab.com/gitlab-com/gl-infra/scalability/uploads/cab2cd03231f8dd4819f77b44d768cb9/redis_snoop.getMaxmemoryState.sha_25a228b839a93a1395907a03f83e1eee448b0f14.production_thresholds.bt) for observing the variables used in computing how much memory to free during each round of evictions. This script's instrumentation points are specific to our particular build of `redis-server`, but the [approach is able to be generalized](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_982498636) and illustrates how versatile this tool can be.\n\nUsage examples:\n* [Example](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_854745083) - Walkthrough of using `perf` and `flamescope` to capture, filter, and visualize the stack sampling CPU profiles of the Redis main thread.\n* [Example](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_857869826) - Walkthrough (including safety check) of using `funclatency` to measure the durations of the frequent calls to function `performEvictions`.\n* [Example](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7172#note_971197943) - Experiment for adjusting Redis settings `lazyfree-lazy-eviction` and `maxmemory-eviction-tenacity` and observing the results using `perf`, `funclatency`, `funcslower`, and the Redis metrics for eviction count and memory usage.\n* [Example](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_982498636) - This is a working example (script included) of using `bpftrace` to observe the values of a function's variables. In this case we inspected the `mem_tofree` calculation at the start of `performEvictions`. Also, these [companion notes](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_982499538) discuss some build-specific considerations.\n* [Example](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1601#note_987049036) - Describes the latency injection experiment (the first of the three ideas). This experiment confirmed that memory demand increases at the predicted rate when we slow response rate to below request arrival rate, in the same way evictions do. This result confirmed the request queuing itself is the source of the memory pressure that amplifies the eviction burst once it begins.\n",[9,755,707],{"slug":1209,"featured":6,"template":688},"how-we-diagnosed-and-resolved-redis-latency-spikes","content:en-us:blog:how-we-diagnosed-and-resolved-redis-latency-spikes.yml","How We Diagnosed And Resolved Redis Latency Spikes","en-us/blog/how-we-diagnosed-and-resolved-redis-latency-spikes.yml","en-us/blog/how-we-diagnosed-and-resolved-redis-latency-spikes",{"_path":1215,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1216,"content":1222,"config":1229,"_id":1231,"_type":13,"title":1232,"_source":15,"_file":1233,"_stem":1234,"_extension":18},"/en-us/blog/how-we-increased-our-release-velocity-with-gitlab",{"title":1217,"description":1218,"ogTitle":1217,"ogDescription":1218,"noIndex":6,"ogImage":1219,"ogUrl":1220,"ogSiteName":672,"ogType":673,"canonicalUrls":1220,"schema":1221},"How we increased our release velocity with GitLab","Learn Evolphin's challenges, reasons for choosing the DevSecOps platform, and our end state following the transition.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749668437/Blog/Hero%20Images/faster-cycle-times.jpg","https://about.gitlab.com/blog/how-we-increased-our-release-velocity-with-gitlab","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we increased our release velocity with GitLab\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Rahul Bhargava, CTO, Evolphin\"}],\n        \"datePublished\": \"2022-12-05\",\n      }",{"title":1217,"description":1218,"authors":1223,"heroImage":1219,"date":1225,"body":1226,"category":1227,"tags":1228},[1224],"Rahul Bhargava, CTO, Evolphin","2022-12-05","\nAt Evolphin, we have a remotely-distributed software development team creating the [Evolphin Zoom Media Asset Management system](https://evolphin.com/media-asset-management/). Our core R&D team is split across multiple geographies, with staff in India, the U.S., and the Philippines, as well as freelancers around the world. We needed to find new ways to address our team challenges and increase the pace of delivery of our product updates to Evolphin Zoom suite, in response to our customer needs. This blog outlines our challenges, reasons for choosing GitLab, and our end state, including a 30% to 40% increase in our release velocity, following the transition.\n\n## What is a media asset management system? \n\nWith the increased demand for video content for entertainment, marketing, customer engagement, etc., media asset management systems have become increasingly popular for collaborating, organizing, and archiving rich media assets. \n\nThe assorted camera card types, encoding formats, and publishing demands of social media and other video-on-demand services create a heterogenous content creation and publishing industry desperate for order. Media asset management systems are a timely answer to the problem of managing and unifying the diverse media assets characteristic of the industry.\n\nAt Evolphin, we’re at the heart of this solution with the Evolphin Zoom Media Asset Management system, an enterprise offering that runs on approximately 4.7 million lines of source code. To address the root of the problem, media asset management products like Evolphin Zoom must rapidly evolve - add new or enhance existing features - to meet customers’ ever-changing needs.\n\n## The problem: Slow updates\n\nBefore adopting GitLab, we used Subversion (Tortoise as the UI) as our source code repository and software version management system. We chose Subversion at the time because we needed an on-premises solution, as cloud-based branch management was not widely adopted in 2012 when we started working on the Evolphin Zoom. \n\nOur branching and merging workflow with Subversion was tedious, slow, and complicated. It took us around four to five weeks to manually manage and merge software changes across branches within this system. This meant that releasing each product update took five weeks at the very minimum. \n\n## Our requirement: Better collaboration for branch management\n\nWe needed a more agile solution to remain responsive to our customers' needs in this fast-paced software development environment. \n\nAs we transitioned to a remotely distributed workforce model, we identified a need for a software version management system designed with decentralized teams in mind. We wanted to be able to create a user story for a new feature in one week, test it with beta users the next week, and release it in production the week after. \n\nFor this level of agility, an affordable, open-source software repository with a platform like GitLab seemed the perfect solution.\n\n## Why GitLab?\n\nWith all the necessary tools for software review management and collaboration, GitLab appeared to fit our needs. \n\nThe ability to remotely check changes into a feature branch meant that users could check in a version and trigger a merge request for approval before merging changes from the remote user’s branch into the main software development branch. \n\nAll these features were available under GitLab’s free community version, with a user-friendly, visually-appealing UI that eased our transition from on-premises to cloud-based development. \n\n## End-state with GitLab\n\nHere is our workflow in numbers:\n\n| Total GitLab projects managed | 44  \t   \t\t\t\t\t\t|\t\n| Total branches \t\t\t\t| 514\t   \t\t\t\t\t\t|\n| Total repo size\t\t\t\t| 10.03 GB \t\t\t\t\t\t|\n| Total users\t\t\t\t\t| 33\t   \t\t\t\t\t\t|\n| Total groups\t\t\t\t\t| 15 \t   \t\t\t\t\t\t|\n| MFA-enabled\t\t\t\t\t| Yes \t   \t\t\t\t\t\t|\n| Number of files\t\t\t\t| 26125 text files  \t\t\t|\n| Number of unique files\t\t| 25090 unique files\t\t\t|\n| Code\t\t\t\t\t\t\t| 4,738,187 lines of code \t   \t|\n| GitLab product plan\t\t\t| Community plan on the cloud\t|\n\n\nOur new workflow depends on GitLab as the single source of truth for all our source code, binary dependencies, and DevOps projects. We currently have GitLab integrations with our CI/CD pipeline using Jenkins and our issue-tracking system - JetBrains YouTrack. Besides source code management (SCM), we use code review features frequently. In addition,  all our internal docs, requirements gathering, tips and tricks between developers, DevOps, and QA are shared in Wiki. All our collaboration happens over GitLab Wikis and SCM. Our developers and DevOps engineers use the same GitLab repo to make it easy to manage source code and build artifacts for deployment.\n\nSince the pandemic started, we have executed several Amazon Web Services (AWS) cloud-based deployments. Some of our DevOps projects in GitLab are integrated with the AWS cloud formation stacks/scripts to enable consistent tenant deployments for our cloud customers.\n\n## Impact on Evolphin’s customers\n\nThe biggest transformation we noticed from adopting GitLab was a more seamless, collaborative, and efficient workflow for our R&D teams. \n\nFor example, a bug fix could be implemented in branches by developers working in parallel, which could then be merged into a pre-production branch for QA. Following the QA review, changes can be pushed to the main production branch for release. \n\nBeing open source, we can easily integrate with CI/CD platforms and the new workflow significantly improved our productivity regarding feature releases, especially taking into consideration our high volume of product updates. With GitLab, we can execute feature releases two to three weeks faster than previously. This includes twice-monthly feature changes, and monthly security updates, with annual major product changes. Overall, our release velocity increased by 30% to 40% just by switching from Subversion to a GitLab-based workflow.\n\n_Rahul Bhargava is the CTO and founder of Evolphin Software._\n","customer-stories",[9,987,108],{"slug":1230,"featured":6,"template":688},"how-we-increased-our-release-velocity-with-gitlab","content:en-us:blog:how-we-increased-our-release-velocity-with-gitlab.yml","How We Increased Our Release Velocity With Gitlab","en-us/blog/how-we-increased-our-release-velocity-with-gitlab.yml","en-us/blog/how-we-increased-our-release-velocity-with-gitlab",{"_path":1236,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1237,"content":1243,"config":1250,"_id":1252,"_type":13,"title":1253,"_source":15,"_file":1254,"_stem":1255,"_extension":18},"/en-us/blog/how-were-building-up-performance-testing-of-gitlab",{"title":1238,"description":1239,"ogTitle":1238,"ogDescription":1239,"noIndex":6,"ogImage":1240,"ogUrl":1241,"ogSiteName":672,"ogType":673,"canonicalUrls":1241,"schema":1242},"How GitLab's QA Team Leverages Performance Testing Tools","We built our open source GitLab Performance tool to evaluate pain points and implement solutions on every GitLab environment.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749681087/Blog/Hero%20Images/performance-server-front.jpg","https://about.gitlab.com/blog/how-were-building-up-performance-testing-of-gitlab","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How our QA team leverages GitLab’s performance testing tool (and you can too)\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Grant Young\"}],\n        \"datePublished\": \"2020-02-18\",\n      }",{"title":1244,"description":1239,"authors":1245,"heroImage":1240,"date":1246,"body":1247,"category":681,"tags":1248},"How our QA team leverages GitLab’s performance testing tool (and you can too)",[1183],"2020-02-18","\n\nWe’ve set up several initiatives aimed at testing and improving the performance of GitLab, which is why the Quality team built a new tool to test GitLab's performance.\n\nPerformance testing is an involved process and distinct from other testing disciplines. The strategies and tooling in this space are specialized and require dedicated resources to achieve results. When I joined the company and became the first member of this team, the task was to expand our nascent performance efforts to a much larger scale. For this, we needed to build out a new tool that we aptly named the [GitLab Performance tool](https://gitlab.com/gitlab-org/quality/performance) (GPT).\n\nWe're happy to announce the general release of [GPT](https://gitlab.com/gitlab-org/quality/performance/-/releases). In this blog post, we'll share how GPT is used to performance test GitLab, and how you can use it as well to test your own environments.\n\nHowever, before we get into what the GPT is, we need to first touch on what we use it with.\n\n## Reference Architectures and test data\n\nIn our experience, the challenging part of performance testing isn’t actually to do with the testing itself, but instead configuring the right environments and data to test against.\n\nAs such, one of the initiatives we’ve been driving is the design of several [GitLab Reference Architectures](https://docs.gitlab.com/ee/administration/reference_architectures/index.html#available-reference-architectures) that can handle large numbers of users. We wanted to create these architectures as a way to standardize our recommended configurations to ensure we were presenting customers with options for performant, scalable, and highly available GitLab setups.\n\nIn order to create a tool like this, we needed to add realistic data into these environments to test against, e.g., large projects with commits and merge requests. As a first iteration, we started with our very own GitLab project.\n\nOnce we got our environments running and configured we were ready to test them with the GPT.\n\n## What is the GitLab Performance tool (GPT)?\n\nThe GPT can be used to run numerous load tests to verify the performance of any GitLab environment. All that’s required is to a knowledge of what throughput the intended environment can handle (as requests per second) and to ensure that the environment has the necessary data prepared.\n\nThe GPT is built upon one of the leading tools in the industry, [k6](https://k6.io/). Here are some examples of what the GPT provides:\n\n* A broad test suite that comes out-of-the-box and covers various endpoints across the GitLab product with added ability to add your own custom tests as desired. [See the latest out-of-the-box test details](https://gitlab.com/gitlab-org/quality/performance/-/wikis/current-test-details) with more being added frequently.\n* [Options](https://gitlab.com/gitlab-org/quality/performance/-/blob/master/docs/k6.md#options) for customizing test runs, such as specifying desired GitLab environment data or defining what throughput to use with default examples given.\n* [Ability to run multiple tests sequentially as well as be selective about which are chosen](https://gitlab.com/gitlab-org/quality/performance/-/blob/master/docs/k6.md#running-the-tests-with-the-tool).\n* [Enhanced reporting and logging](https://gitlab.com/gitlab-org/quality/performance/-/blob/master/docs/k6.md#running-the-tests-with-the-tool).\n* [Built-in test success thresholds](https://gitlab.com/gitlab-org/quality/performance/-/blob/master/docs/k6.md#test-thresholds) based on [time to first byte](https://en.wikipedia.org/wiki/Time_to_first_byte), throughput achieved and successful responses.\n\nThe talented team at [Load Impact](https://loadimpact.com/) created [k6](https://k6.io/), which is the core of the GPT. We realized quickly that we didn’t need to reinvent the wheel because k6 met most of our needs: It is written in Go, so is very performant and is an open source solution. Thanks to the team for not only developing k6 but also for reaching out to us soon after we started to collaborate.\n\n## How we use GPT\n\nWe use the GPT in several automated [GitLab CI pipelines](/blog/guide-to-ci-cd-pipelines/) for quick feedback on how GitLab is performing. The CI pipelines typically run daily or weekly against our reference architecture environments, which themselves are running on the latest pre-release code. We review the test results as they come in and then investigate any failures. In line with our [Transparency value](https://handbook.gitlab.com/handbook/values/#transparency), we also publish all of the [latest results](https://gitlab.com/gitlab-org/quality/performance/-/wikis/Benchmarks/Latest) for anyone to view on the [GPT wiki](https://gitlab.com/gitlab-org/quality/performance/-/wikis/home).\n\nThe GPT is also used in a comparison test pipeline to see how GitLab’s performance changes in every release cycle. These results are important because they show the whole picture of our performance evolution. The [benchmark comparison results](https://gitlab.com/gitlab-org/quality/performance/-/wikis/Benchmarks/GitLab-Versions) are also available on the [GPT wiki](https://gitlab.com/gitlab-org/quality/performance/-/wikis/home).\n\nBy using the GPT, we’ve been able to identify several performance pain points of GitLab and collaborate with our dev teams to prioritize improvements. The process has been fruitful so far and we’re excited to already see improvements in the performance numbers with each release of GitLab. The 12.6 release for example showed [several notable improvements across the board](https://gitlab.com/gitlab-org/quality/performance/-/wikis/Benchmarks/GitLab-Versions#comparisions), one even as high as a 92% reduction! You can see the issues we've raised so far through this work over on our [issue tracker](https://gitlab.com/gitlab-org/gitlab/issues?scope=all&utf8=%E2%9C%93&state=all&label_name[]=Quality%3Aperformance-issues).\n\n## How you can use GPT\n\nWe decided early that we wanted to follow the same open source principles as our main product, so we build the GPT with all users in mind rather than making it a strictly internal tool. So not only do we let others use it, we encourage it! This is beneficial for us and customers, as we receive feedback from diverse viewpoints that we hadn’t considered. Some examples of this are [improving the recommended spec guidelines based on throughput](https://gitlab.com/gitlab-org/quality/performance/issues/172) or [making it easier for users who have private clouds to use the GPT offline](https://gitlab.com/gitlab-org/quality/performance/issues/106).\n\nIf you want to use the GPT for yourself, the best place to start is with its [documentation](https://gitlab.com/gitlab-org/quality/performance#documentation). As mentioned earlier, most of the effort to use the GPT is preparing the intended environment. The docs will take you through this along with how to use the tool itself.\n\n## The GPT in action\n\nFinally after writing all about the GPT we should actually show it in action. Here's how it looks when running a load test for the [List Group Projects API](https://docs.gitlab.com/ee/api/groups.html#list-a-groups-projects) against our [10k Reference Architecture](https://docs.gitlab.com/ee/administration/reference_architectures/10k_users.html):\n\n[![asciicast](https://asciinema.org/a/O96Wc5fyxvLb1IDyviTwbujg8.svg)](https://asciinema.org/a/O96Wc5fyxvLb1IDyviTwbujg8?autoplay=1)\n\nRead the GPT [documentation](https://gitlab.com/gitlab-org/quality/performance/blob/master/docs/k6.md#test-output-and-results) for more details on output and results.\n\n## What’s next?\n\nOur aim is to make GitLab’s performance best in class. This is only the start of our performance testing journey with GPT and we are excited about the additional ways we can continue to help improve the customer experience.\n\n[Some examples of our plans for the next few releases](https://gitlab.com/gitlab-org/quality/performance/issues) include expanding test coverage to more of GitLab’s features and entry points (API, Web, Git) and expanding our work on the reference architectures, test data, and user behavior patterns to be as representative and realistic as possible.\n\nShare your feedback and/or suggestions on GPT here or on our [GPT project](https://gitlab.com/gitlab-org/quality/performance)! We welcome your ideas or contributions.\n\nCover image by [Taylor Vick](https://unsplash.com/@tvick?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/s/photos/server?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).\n{: .note}\n",[9,1249],"testing",{"slug":1251,"featured":6,"template":688},"how-were-building-up-performance-testing-of-gitlab","content:en-us:blog:how-were-building-up-performance-testing-of-gitlab.yml","How Were Building Up Performance Testing Of Gitlab","en-us/blog/how-were-building-up-performance-testing-of-gitlab.yml","en-us/blog/how-were-building-up-performance-testing-of-gitlab",{"_path":1257,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1258,"content":1264,"config":1269,"_id":1271,"_type":13,"title":1272,"_source":15,"_file":1273,"_stem":1274,"_extension":18},"/en-us/blog/inside-dora-performers-score-in-gitlab-value-streams-dashboard",{"title":1259,"description":1260,"ogTitle":1259,"ogDescription":1260,"noIndex":6,"ogImage":1261,"ogUrl":1262,"ogSiteName":672,"ogType":673,"canonicalUrls":1262,"schema":1263},"Inside DORA Performers score in GitLab Value Streams Dashboard ","Learn how four key metrics drive DevOps maturity, helping teams optimize workflows and achieve DevOps excellence.\n","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098908/Blog/Hero%20Images/Blog/Hero%20Images/AdobeStock_644947854_248JIrEOCaGJdfJdiSjYde_1750098907747.jpg","https://about.gitlab.com/blog/inside-dora-performers-score-in-gitlab-value-streams-dashboard","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Inside DORA Performers score in GitLab Value Streams Dashboard \",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Haim Snir\"}],\n        \"datePublished\": \"2024-01-18\",\n      }",{"title":1259,"description":1260,"authors":1265,"heroImage":1261,"date":1266,"body":1267,"category":730,"tags":1268},[818],"2024-01-18","The DevOps Research and Assessment ([DORA](https://docs.gitlab.com/ee/user/analytics/dora_metrics.html)) metrics are industry-standard measurements to help better understand the capabilities that drive software delivery and operations performance. GitLab recently added a DORA Performers score panel to the Value Streams Dashboard in the GitLab DevSecOps Platform to visualize the status of the organization's DevOps performance across different projects.\n\nThis new visualization displays a breakdown of the DORA performance levels, designating a score level for each project under a group. Executives can use this visualization to easily identify the highs and lows in DORA scores and understand their organization's DevOps health top to bottom.\n\n> [Try the Value Streams Dashboard today.](https://about.gitlab.com/blog/getting-started-with-value-streams-dashboard/)\n\n## What are DORA metrics?\n\nDuring the past nine years, the DORA team gathered insights from over 36,000 professionals around the globe on how to measure the performance of a software development team. They identified four metrics as key indicators to measure software teams' development effectiveness and efficiency:\n\n- [Deployment frequency](https://docs.gitlab.com/ee/user/analytics/dora_metrics.html#deployment-frequency) and [Lead time for changes](https://docs.gitlab.com/ee/user/analytics/dora_metrics.html#lead-time-for-changes) measure team velocity.\n- [Change failure rate](https://docs.gitlab.com/ee/user/analytics/dora_metrics.html#change-failure-rate) and [Time to restore service](https://docs.gitlab.com/ee/user/analytics/dora_metrics.html#time-to-restore-service) measure stability.\n\nBy analyzing these metrics, teams are able to find areas for improvement, optimize their workflows, and ultimately drive positive business results.\n\nDORA uses these metrics to identify high-performing, medium-performing, and low-performing teams.  These performance levels provide a framework for organizations to assess their DevOps maturity and effectiveness.\n\n![DORA performers](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098929/Blog/Content%20Images/Blog/Content%20Images/image1_aHR0cHM6_1750098929143.png)\n\nHigh performance indicates that the team is operating at excellent speed and stability in their software delivery, reaching the peak of DevOps maturity.\n\nMedium and low performance levels suggest opportunities for improvement in different aspects of the software development and delivery process.\n\nLet's take a closer look at the DORA definition for each performance level.\n\n![Chart of performance metrics](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098929/Blog/Content%20Images/Blog/Content%20Images/image2_aHR0cHM6_1750098929144.png)\n\u003Csup>\u003Csub>_Source: [DORA Accelerate State of DevOps report](https://cloud.google.com/blog/products/devops-sre/dora-2022-accelerate-state-of-devops-report-now-out)_\u003C/sub>\u003C/sup>\u003Cp>\u003C/p>\n\n## GitLab definitions for the DORA score performance levels\n\nDORA metrics are available out of the box in the GitLab DevSecOps platform. To enable the score calculation to operate \"out of the box\" with GitLab, we adjust the scoring rules so they work with the platform's unified data model. Read more in the [score definition documentation](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html#dora-performers-score-panel).\n\nThe goal is for organizations to strive for high performance in these metrics, as a high score often correlates with better business outcomes, such as increased efficiency, faster time-to-market, and higher software quality.\n\n## DORA metrics in GitLab\n\nIn addition to the Value Streams dashboard, the DORA metrics are available also on the [CI/CD analytics charts](https://docs.gitlab.com/ee/user/analytics/ci_cd_analytics.html), which show the history of DORA metrics over time, and on [Insights reports](https://docs.gitlab.com/ee/user/project/insights/index.html#dora-query-parameters) where you can create custom charts.\n\nWatch our DORA overview video:\n\n\u003C!-- blank line -->\n\u003Cfigure class=\"video_container\">\n \u003Ciframe src=\"https://www.youtube.com/embed/jYQSH4EY6_U?si=sE9rf_X58BGD2uK9\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\u003C!-- blank line -->\n\n## Get started today\nYou can get started with the Value Streams Dashboard by [following the instructions](https://about.gitlab.com/blog/getting-started-with-value-streams-dashboard/) in this documentation. Then, to help us improve the value of the Value Streams Dashboard, please share feedback about your experience in this [brief survey](https://gitlab.fra1.qualtrics.com/jfe/form/SV_50guMGNU2HhLeT4).\n",[707,758,481,781,9],{"slug":1270,"featured":6,"template":688},"inside-dora-performers-score-in-gitlab-value-streams-dashboard","content:en-us:blog:inside-dora-performers-score-in-gitlab-value-streams-dashboard.yml","Inside Dora Performers Score In Gitlab Value Streams Dashboard","en-us/blog/inside-dora-performers-score-in-gitlab-value-streams-dashboard.yml","en-us/blog/inside-dora-performers-score-in-gitlab-value-streams-dashboard",{"_path":1276,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1277,"content":1283,"config":1290,"_id":1292,"_type":13,"title":1293,"_source":15,"_file":1294,"_stem":1295,"_extension":18},"/en-us/blog/inside-look-how-gitlabs-test-platform-team-validates-ai-features",{"title":1278,"description":1279,"ogTitle":1278,"ogDescription":1279,"noIndex":6,"ogImage":1280,"ogUrl":1281,"ogSiteName":672,"ogType":673,"canonicalUrls":1281,"schema":1282},"Inside look: How GitLab's Test Platform team validates AI features","Learn how we continuously analyze AI feature performance, including testing latency worldwide, and get to know our new AI continuous analysis tool.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099033/Blog/Hero%20Images/Blog/Hero%20Images/blog-image-template-1800x945%20%2811%29_78Dav6FR9EGjhebHWuBVan_1750099033422.png","https://about.gitlab.com/blog/inside-look-how-gitlabs-test-platform-team-validates-ai-features","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Inside look: How GitLab's Test Platform team validates AI features\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Mark Lapierre\"},{\"@type\":\"Person\",\"name\":\"Vincy Wilson\"}],\n        \"datePublished\": \"2024-06-03\",\n      }",{"title":1278,"description":1279,"authors":1284,"heroImage":1280,"date":1287,"body":1288,"category":821,"tags":1289},[1285,1286],"Mark Lapierre","Vincy Wilson","2024-06-03","AI is increasingly becoming a centerpiece of software development - many companies are integrating it throughout their DevSecOps workflows to improve productivity and increase efficiency. Because of this now-critical role, AI features should be tested and analyzed on an ongoing basis. In this article, we take you behind the scenes to learn how [GitLab's Test Platform team](https://handbook.gitlab.com/handbook/engineering/infrastructure/test-platform/) does this for [GitLab Duo](https://about.gitlab.com/gitlab-duo/) features by conducting performance validation, functional readiness, and continuous analysis across GitLab versions. With this three-pronged approach, GitLab aims to ensure that GitLab Duo features are performing optimally for our customers.\n\n> Discover the future of AI-driven software development with our GitLab 17 virtual launch event. [Watch today!](https://about.gitlab.com/seventeen/)\n\n## AI and testing\n\nAI's non-deterministic nature, where the same input can produce different outputs, makes ensuring a great user experience a challenge. So, when we integrated AI deep into the GitLab DevSecOps Platform, we had to adapt to our best practices to address this challenge. \n\nThe [Test Platform team's mission ](https://handbook.gitlab.com/handbook/engineering/infrastructure/test-platform/) is to help enable the successful development and deployment of high-quality software applications with continuous analysis and efficiency to help ensure customer satisfaction. The key to achieving this is by delivering tools that help increase standardization, repeatability, and test consistency. \n\nApplying this to GitLab Duo, our AI suite of tools to power DevSecOps workflows, means being able to continuously analyze its performance and identify opportunities for improvement. Our goal is to gain clear, actionable insights that will help us to enhance GitLab Duo's capabilities and, as a result, better meet our customers' needs. \n\n## The need for continuous analysis of AI\n\nTo continuously assess GitLab Duo, we needed a mechanism for analyzing feature performance across releases. Therefore, we created an AI continuous analysis tool to automate the collection and analysis of data to achieve this. \n\n![diagram of how the AI continuous analysis tool works](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099041/Blog/Content%20Images/Blog/Content%20Images/image1_aHR0cHM6_1750099041503.png)\n\n\u003Ccenter>\u003Ci>How the AI continuous analysis tool works\u003C/i>\u003C/center>\n\n### Building the AI continuous analysis tool\n\nTo gain detailed, user-centric insights, we needed to gather data in the appropriate context – in this case, the integrated development environment (IDE), as it is where most of our users access GitLab Duo. We narrowed this down further by opting for the Visual Studio Code IDE, a popular choice within our community. Once the environment was chosen, we automated entering code prompts and recording the provided suggestions. The interactions with the IDE are handled by the [WebdriverIO VSCode service](https://github.com/webdriverio-community/wdio-vscode-service), and CI operations are handled through [GitLab CI/CD](https://docs.gitlab.com/ee/ci/). This automation significantly scaled up data collection and eliminated repetitive tasks for GitLab team members. To start, we have focused on measuring the performance of GitLab Duo Code Suggestions, but plan to expand to other GitLab AI features in the future.\n\n### Analyzing the data\n\nAt the core of our AI continuous analysis tool is a mechanism for collecting and analyzing code suggestions. This involves automatically entering code prompts, recording the suggestions provided, and logging timestamps of relevant events. We measure the time from when the tool provides an input until a suggestion is displayed in the UI. In addition, we record the logs created by the IDE, which report the time it took for each suggestion response to be received. With this data, we can compare the latency of suggestions in terms of how long it takes the backend AI service to send a response to the IDE, and how long it takes for the IDE to display the suggestion for the user. We then can compare latency and other metrics of GitLab Duo features across multiple releases. The GitLab platform has the ability to analyze [code quality](https://docs.gitlab.com/ee/ci/testing/code_quality.html) and [application security](https://docs.gitlab.com/ee/user/application_security/), so we leverage these capabilities to enable the AI continuous analysis tool to analyze the quality and security of the suggestions provided by GitLab Duo.\n\n### Improving AI-driven suggestions\n\nOnce the collected data is analyzed, the tool automatically generates a single report summarizing the results. The report includes key statistics (e.g., mean latency and/or latency at various percentiles), descriptions of notable differences or patterns, links to raw data, and CI/CD pipeline logs and artifacts. The tool also records a video of each prompt and suggestion, which allows us to review specific cases where differences are highlighted. This creates an opportunity for the UX researchers and development teams to take action on the insights gained, helping to improve the overall user experience and system performance.\n\nThe tool is at an early stage of development, but it's already helped us to improve the experience for GitLab Duo Code Suggestions users. Moving forward, we plan to expand our tool’s capabilities, incorporate more metrics and consume and provide input to our [Centralized Evaluation Framework](https://about.gitlab.com/direction/ai-powered/ai_framework/ai_evaluation/), which validates AI models, to enhance our continuous analysis further.\n\n## Performance validation\n\nAs AI has become integral to GitLab's offerings, optimizing the performance of AI-driven features is essential. Our performance tests aim to evaluate and monitor the performance of our GitLab components, which interact with AI service backends. While we can monitor the performance of these external services as part of our production environment's observability, we cannot control them. Thus, including third-party services in our performance testing would be expensive and yield limited benefits. Although third-party AI providers contribute to overall latency, the latency attributable to GitLab components is still important to check. We aim to detect changes that might lead to performance degradation by monitoring GitLab components. \n\n### Building AI performance validation test environment\n\nIn our AI test environments, the [AI Gateway](https://docs.gitlab.com/ee/architecture/blueprints/ai_gateway/#summary), which is a stand-alone service to give access to AI features to GitLab users, has been configured to return mocked responses, enabling us to test the performance of AI-powered features without interacting with third-party AI service providers. We conduct AI performance tests on [reference architecture environments of various sizes](https://docs.gitlab.com/ee/administration/reference_architectures/). Additionally, we evaluate new tests in their own isolated environment before they're added to the larger environments.\n\n### Testing multi-regional latency\n\nMulti-regional latency tests need to be run from various geolocations to validate that requests are being served from a suitable location close to the source of the request. We do this today with the use of the [GitLab Environment Toolkit](https://gitlab.com/gitlab-org/gitlab-environment-toolkit). The toolkit provisions an environment in the identified region to test (note: both the AI Gateway and the provisioned environment are in the same region), then uses the [GitLab Performance Tool](https://gitlab.com/gitlab-org/quality/performance) to run tests to measure time to first byte (TTFB). TTFB is our way of measuring time to the first part of the response being rendered, which contributes to the perceived latency that a customer experiences. To account for this measurement, our tests have a check to help ensure that the [response itself isn't empty](https://gitlab.com/gitlab-org/quality/performance/-/blob/cee8bef023e590e6ca75828e49f5c7c596581e06/k6/tests/experimental/api_v4_code_suggestions_generation_streaming.js#L70). \n\nOur tests are expanding further to continue to measure perceived latency from a customer’s perspective. We have captured a set of baseline response times that indicate how a specific set of regions performed when the test environment was in a known good state. These baselines allow us to compare subsequent environment updates and other regions to this known state to evaluate the impact of changes. These baseline measurements can be updated after major updates to ensure they stay relevant in the future. \n\nNote: As of this article's publication date, we have AI Gateway deployments across the U.S., Europe, and Asia. To learn more, visit our [handbook page](https://handbook.gitlab.com/handbook/engineering/development/data-science/ai-powered/ai-framework/#-aigw-region-deployments).\n\n## Functionality\n\nTo help continuously enable customers to confidently leverage AI reliably, we must continuously work to ensure our AI features function as expected.\n\n### Unit and integration tests\n\nFeatures that leverage AI models still require rigorous automated tests, which help engineers develop new features and changes confidently. However, since AI features can involve integrating with third-party AI providers, we must be careful to stub any external API calls to help ensure our tests are fast and reliable.\n\nFor a comprehensive look at testing at GitLab, look at our [testing standards and style guidelines](https://docs.gitlab.com/ee/development/testing_guide/). \n\n### End-to-end tests \n\nEnd-to-end testing is a strategy for checking whether the application works as expected across the entire software stack and architecture. We've implemented it in two ways for GitLab Duo testing: using real AI-generated responses and mock-generated AI responses.\n\n![validating features - image 2](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750099041/Blog/Content%20Images/Blog/Content%20Images/image2_aHR0cHM6_1750099041504.png)\n\n\u003Ccenter>\u003Ci>End-to-end test workflow\u003C/i>\u003C/center>\n\n#### Using real AI-generated responses\n\nAlthough costly, end-to-end tests are important to help ensure the entire user experience functions as expected. Since AI models are non-deterministic, end-to-end test assertions for validating real AI-generated responses should be loose enough to help ensure the feature functions without relying on a response that may change. This might mean an assertion that checks for some response with no errors or for a response we are certain to receive.\n\nAI-driven functionality is not accessible only from within the GitLab application, so we must also consider user workflows for other applications that leverage these features. For example, to cover the use case of a developer requesting code suggestions in [IntelliJ IDEA](https://www.jetbrains.com/idea/) using the GitLab Duo plugin, we need to drive the IntelliJ application to simulate a user workflow. Similarly, to ensure that the GitLab Duo Chat experience is consistent in VS Code, we must drive the VS Code application and exercise the GitLab Workflow extension. Working to ensure these workflows are covered helps us maintain a consistently great developer experience across all GitLab products. \n\n#### Using mock AI-generated responses\n\nIn addition to end-to-end tests using real AI-generated responses, we run some end-to-end tests against test environments configured to return mock responses. This allows us to verify changes to GitLab code and components that don’t depend on responses generated by an AI model more frequently.\n\n> For a closer look at end-to-end testing, read our [end-to-end testing guide](https://docs.gitlab.com/ee/development/testing_guide/end_to_end/). \n\n### Exploratory testing and dogfooding\n\nAI features are built by humans for humans. At GitLab, exploratory testing and dogfooding greatly benefit us. GitLab team members are passionate about what features get shipped, and insights from internal usage are invaluable in shaping the direction of AI features.\n\n[Exploratory testing](https://about.gitlab.com/topics/devops/devops-test-automation/#test-automation-stages) allows the team to creatively exercise features to help ensure edge case bugs are identified and resolved. Dogfooding encourages team members to use AI features in their daily workflows, which helps us identify realistic issues from realistic users. For a comprehensive look at how we dogfood AI features, look at [Developing GitLab Duo: How we are dogfooding our AI features](https://about.gitlab.com/blog/developing-gitlab-duo-how-we-are-dogfooding-our-ai-features/).\n\n## Get started with GitLab Duo\nHopefully this article gives you insight into how we are validating AI features at GitLab. We have integrated our team's process into our overall development as we iterate on GitLab Duo features. We encourage you to try GitLab Duo in your organization and reap the benefits of AI-powered workflows.\n\n> Start a [free trial of GitLab Duo](https://about.gitlab.com/gitlab-duo/#free-trial) today!\n\n_Members of the GitLab Test Platform team contributed to this article._\n",[759,823,481,754,1249,9],{"slug":1291,"featured":90,"template":688},"inside-look-how-gitlabs-test-platform-team-validates-ai-features","content:en-us:blog:inside-look-how-gitlabs-test-platform-team-validates-ai-features.yml","Inside Look How Gitlabs Test Platform Team Validates Ai Features","en-us/blog/inside-look-how-gitlabs-test-platform-team-validates-ai-features.yml","en-us/blog/inside-look-how-gitlabs-test-platform-team-validates-ai-features",{"_path":1297,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1298,"content":1304,"config":1310,"_id":1312,"_type":13,"title":1313,"_source":15,"_file":1314,"_stem":1315,"_extension":18},"/en-us/blog/installing-gitlab-on-raspberry-pi-64-bit-os",{"title":1299,"description":1300,"ogTitle":1299,"ogDescription":1300,"noIndex":6,"ogImage":1301,"ogUrl":1302,"ogSiteName":672,"ogType":673,"canonicalUrls":1302,"schema":1303},"Installing GitLab on Raspberry Pi 64-bit OS","A Raspberry Pi enthusiast tries to run GitLab on the new 64-bit OS...and here's what happened.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749679433/Blog/Hero%20Images/anto-meneghini-gqytxsrctvw-unsplash.jpg","https://about.gitlab.com/blog/installing-gitlab-on-raspberry-pi-64-bit-os","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Installing GitLab on Raspberry Pi 64-bit OS\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Brendan O'Leary\"}],\n        \"datePublished\": \"2022-03-14\",\n      }",{"title":1299,"description":1300,"authors":1305,"heroImage":1301,"date":1306,"body":1307,"category":681,"tags":1308},[1101],"2022-03-14","\n\n_This blog post and linked pages contain information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes.\nAs with all projects, the items mentioned in this blog post and linked pages are subject to change or delay. The development and release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc._\n\nRecently the 64-bit version of [Raspberry Pi OS](https://www.raspberrypi.com/software/) came out of a long-awaited beta, and as a Raspberry Pi enthusiast, I was eager to get my hands on it. While the 64-bit version isn't compatible with all Pi hardware, it's exciting to see the expansion of the ecosystem to allow for better access to RAM and software compatibility as 32-bit support becomes less common.\n\nBut speaking of software support - what about running GitLab on the new 64-bit OS? Did you know that GitLab already has support for [Raspberry Pi OS](/install/#raspberry-pi-os)? We even have documentation on [optomizing GitLab on a Raspberry Pi](https://docs.gitlab.com/omnibus/settings/rpi.html) for folks who want to run their self-hosted DevOps platform on simple hardware like the Pi?\n\nNow, the distribution team would want me to point out that official support for ARM64 is still [in the works](https://gitlab.com/groups/gitlab-org/-/epics/2370), but that didn't stop me from at least wanting to try to install GitLab on this exciting new platform. Remember that your mileage may vary - and don't use this in production as it isn't yet officially supported.  \n\nBut that's never stopped me before, so I grabbed my Raspberry Pi 4, a new Micro SD card, and the updated [Raspberry Pi Imager](https://downloads.raspberrypi.org/imager/imager_latest.dmg) and got started.\n\n## Getting Started\n\nThe typical [install for GitLab on the Raspberry Pi](/install/#raspberry-pi-os) assumes you have the standard 32-bit version of `raspbian/buster` that has been standard for some time. So following those steps, I ran into an error with the install script.\n\nWhen running \n\n```bash \nsudo curl -sS https://packages.gitlab.com/install/repositories/gitlab/raspberry-pi2/script.deb.sh | sudo bash\n```\n\nIt appeared to work, but if I tried to install GitLab I'd get this error\n\n```bash\n$ sudo EXTERNAL_URL=\"https://gitpi.boleary.dev\" apt-get install gitlab-ce\n\nReading package lists... Done\nBuilding dependency tree... Done\nReading state information... Done\nPackage gitlab-ce is not available, but is referred to by another package.\nThis may mean that the package is missing, has been obsoleted, or\nis only available from another source\n \nE: Package 'gitlab-ce' has no installation candidate\n```\nThat's related to the fact that specifically this version of Raspberry Pi OS isn't supported yet - but since it is a fork of Debian Linux, I was able to work around that.\n\n## Manual Installation\n\nTo get started with a slightly modified installation path, I first got the package details and appropriate prerequisite libraries installed:\n\n```bash\ncurl -s https://packages.gitlab.com/install/repositories/gitlab/gitlab-ce/script.deb.sh | sudo bash\n\nsudo apt-get update\n\nsudo apt-get install debian-archive-keyring\n\nsudo apt-get install curl gnupg apt-transport-https\n\ncurl -L https://packages.gitlab.com/gitlab/gitlab-ce/gpgkey | sudo apt-key add -\n```\n\nThen I created a new sources list to point `apt` to for the installation with `sudo touch /etc/apt/sources.list.d/gitlab_gitlab-ce.list`\n\nNext, I manually added the Debian Buster repositories to that sources list I just created by modifying  `/etc/apt/sources.list.d/gitlab_gitlab-ce.list` to add:\n\n```\ndeb https://packages.gitlab.com/gitlab/gitlab-ce/debian/ buster main\ndeb-src https://packages.gitlab.com/gitlab/gitlab-ce/debian/ buster main\n```\n\n## Finishing Up\nFrom there, it was easy to install the 'standard' way, with apt-get handling the rest for me.\n\n```bash\nsudo apt-get update\n\nsudo EXTERNAL_URL=\"http://gitpi.boleary.dev\" apt-get install gitlab-ce\n```\n\n## Next Steps\n\nNow, those who love DNS will notice that I was pointing to a fully qualified domain name, but it points to a private address if you look up that address.\n\n```bash\ndig gitpi.boleary.dev\n; \u003C\u003C>> DiG 9.10.6 \u003C\u003C>> gitpi.boleary.dev\n;; OPT PSEUDOSECTION:\n; EDNS: version: 0, flags:; udp: 512\n;; QUESTION SECTION:\n;gitpi.boleary.dev.\t\tIN\tA\n\n;; ANSWER SECTION:\ngitpi.boleary.dev.\t300\tIN\tA\t100.64.205.40\n```\n\nIsn't that interesting?  What does it mean - can I access it from outside my house's network?  And how will I get it to work with HTTPs on that private address?\n\nFor those answers, you'll have to stay tuned to my next article about running GitLab on the Raspberry Pi: Hosting a private GitLab server with Tailscale and LetsEncrypt.\n\nPhoto by \u003Ca href=\"https://unsplash.com/@antomeneghini?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText\">Anto Meneghini\u003C/a> on \u003Ca href=\"https://unsplash.com/s/photos/raspberries?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText\">Unsplash\u003C/a>\n  \n",[1309,230,9],"demo",{"slug":1311,"featured":6,"template":688},"installing-gitlab-on-raspberry-pi-64-bit-os","content:en-us:blog:installing-gitlab-on-raspberry-pi-64-bit-os.yml","Installing Gitlab On Raspberry Pi 64 Bit Os","en-us/blog/installing-gitlab-on-raspberry-pi-64-bit-os.yml","en-us/blog/installing-gitlab-on-raspberry-pi-64-bit-os",{"_path":1317,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1318,"content":1324,"config":1329,"_id":1331,"_type":13,"title":1332,"_source":15,"_file":1333,"_stem":1334,"_extension":18},"/en-us/blog/introducing-autoscaling-gitlab-runners-on-aws-fargate",{"title":1319,"description":1320,"ogTitle":1319,"ogDescription":1320,"noIndex":6,"ogImage":1321,"ogUrl":1322,"ogSiteName":672,"ogType":673,"canonicalUrls":1322,"schema":1323},"How autoscaling GitLab CI works on AWS Fargate","Run your CI jobs as AWS Fargate tasks with GitLab Runner and the Fargate Driver","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749681285/Blog/Hero%20Images/runner-autoscale-fargate-blog-cover.jpg","https://about.gitlab.com/blog/introducing-autoscaling-gitlab-runners-on-aws-fargate","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How autoscaling GitLab CI works on AWS Fargate\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Darren Eastman\"}],\n        \"datePublished\": \"2020-05-11\",\n      }",{"title":1319,"description":1320,"authors":1325,"heroImage":1321,"date":1326,"body":1327,"category":681,"tags":1328},[1122],"2020-05-11","\n\nAutoscaling GitLab Runner is a unique value proposition for teams that run their self-managed build agents on cloud-hosted virtual machines. As the number of [CI/CD jobs](/topics/ci-cd/) run over a specific period can fluctuate, teams must have build agent auto-scaling solutions in place that are easy to set up, configure, and cost-efficient.  \n\nGitLab Runner [autoscaling](https://docs.gitlab.com/runner/configuration/autoscale.html) responds to demand by provisioning new cloud-hosted virtual machines with Docker and GitLab Runner. When demand is lower, any additional virtual machines above the configured minimum size are de-provisioned. However, while this model of automatically provisioning and terminating virtual machine instances continues to be useful for a vast plethora of use cases, customers also want to take advantage of the capabilities of cloud container orchestration solutions for executing GitLab CI/CD jobs. For some, adopting GitLab's Kubernetes integration for AWS Elastic Kubernetes Service and Google Kubernetes Engine has allowed them to take advantage of the benefits of containerized pipelines. For others, AWS Fargate has proven to be a compelling container orchestration solution, as it simplifies the process of launching and managing containers on AWS services ECS and EKS.\n\nWe are pleased to announce that as of the [12.10](/releases/2020/04/22/gitlab-12-10-released/) release, you can now auto-scale GitLab CI jobs on AWS Fargate managed containers.\n\n![](https://about.gitlab.com/images/blogimages/autoscaling-runners-ci-ecs-fargate.png)\n\n## So how does it work? \n\nIn GitLab 12.1, we released the GitLab Runner [Custom executor](https://docs.gitlab.com/runner/executors/custom.html). With the custom executor, you can create drivers for GitLab Runner to execute a job on technology or a platform that is not supported natively. To enable executing GitLab CI jobs on AWS Fargate, we developed a [GitLab AWS Fargate driver](https://gitlab.com/gitlab-org/ci-cd/custom-executor-drivers/fargate) for the Custom executor.  This driver uses the AWS Fargate `run-task` action to schedule a new task. A task in ECS is an instance of a task definition that runs the container or containers defined within the task definition. In this containerized solution for CI builds, the pipeline job executes on a container built from an image that must include the tools that you need to build your application.\n\nThe AWS Fargate Driver works in conjunction with GitLab Runner, a lightweight executable that executes pipeline jobs. Similar to the GitLab Runner executable, a `config.toml` file is the file used to pass configuration parameters to the driver. The AWS Fargate driver divides the CI job into the following stages.\n\n1. Config\n1. Prepare\n1. Run\n1. Cleanup\n\n## SSH connectivity\n\nFor the Fargate Driver to execute build commands in the container that is running as a task on ECS, the driver needs to be able to SSH into the container. So we have built additional capabilities into the driver to allow for a SSH connection between the GitLab Runner + AWS Fargate driver and the CI build container. \n\n![Fargate Driver SSH Connectivity](https://about.gitlab.com/images/blogimages/runner_fargate_driver_ssh.png)\n\n## Limitations\n\nAWS Fargate does not support running containers in privileged mode. For example, Docker-in-Docker (DinD), which enables the building and running of container images inside of containers, does not work on Fargate. In keeping with one of GitLab's core values, iteration, we will continue to iterate on solutions for this problem. So stay tuned for future enhancements.\n\n## Getting Started\n\nTo get started, review our detailed [configuration and setup guide.](https://docs.gitlab.com/runner/configuration/runner_autoscale_aws_fargate/index.html)\n\nWith the release of the GitLab Runner AWS Fargate driver, we provide the most diverse set of options in the industry for executing CI pipeline jobs in an autoscaling configuration. These options now include cloud-delivered virtual machines, AWS EC2, Google GCP, Azure Compute, and container orchestration platforms: AWS EKS, AWS ECS + Fargate, and Google Kubernetes. Our long term goal is to provide the best and most comprehensive solution for executing CI jobs at scale on the major cloud platforms.\n\n\nCover image by [Alessio Lin](https://unsplash.com/@lin_alessio) on [Unsplash](https://www.unsplash.com)\n{: .note}\n",[108,823,9],{"slug":1330,"featured":6,"template":688},"introducing-autoscaling-gitlab-runners-on-aws-fargate","content:en-us:blog:introducing-autoscaling-gitlab-runners-on-aws-fargate.yml","Introducing Autoscaling Gitlab Runners On Aws Fargate","en-us/blog/introducing-autoscaling-gitlab-runners-on-aws-fargate.yml","en-us/blog/introducing-autoscaling-gitlab-runners-on-aws-fargate",{"_path":1336,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1337,"content":1343,"config":1350,"_id":1352,"_type":13,"title":1353,"_source":15,"_file":1354,"_stem":1355,"_extension":18},"/en-us/blog/job-artifact-meta-data-expiration-change",{"title":1338,"description":1339,"ogTitle":1338,"ogDescription":1339,"noIndex":6,"ogImage":1340,"ogUrl":1341,"ogSiteName":672,"ogType":673,"canonicalUrls":1341,"schema":1342},"Artifact and job meta data expiration settings are changing for GitLab.com","Default expiration dates for job meta data and artifacts will change on June 22, 2020. Find out how this benefits all users of GitLab.com","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749666262/Blog/Hero%20Images/default-blog-image.png","https://about.gitlab.com/blog/job-artifact-meta-data-expiration-change","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Artifact and job meta data expiration settings are changing for GitLab.com\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Parker Ennis\"}],\n        \"datePublished\": \"2020-06-18\",\n      }",{"title":1338,"description":1339,"authors":1344,"heroImage":1340,"date":1346,"body":1347,"category":948,"tags":1348},[1345],"Parker Ennis","2020-06-18","\n\nTo help maintain overall stability and performance for all GitLab users, some changes are coming to GitLab.com effective June 22, 2020. This will directly impact the default retention period for [older job metadata](https://docs.gitlab.com/ee/administration/settings/continuous_integration.html#archive-jobs) as well as enable a [default expiration policy](https://docs.gitlab.com/ee/user/gitlab_com/index.html#gitlab-cicd) for older job artifacts.\n\n## TL;DR\n\n### **Updating GitLab.com's Job Artifact Expiration Policy**\n\nWhat you need to know:\n* Starting June 22, 2020, new job artifacts will immediately default to a 30 day expiration upon initial creation.\n* Existing job artifacts without an expiration date that were created after October 22, 2019, but before June 22, 2020, will have their expiration date set to 1 year after the initial creation date.\n   * _For example, if the last pipeline job you created and ran successfully was on November 10th, 2019, then the job artifacts produced by that job upon completion would have a default expiration date automatically set to November 10th, 2020._\n* Existing job artifacts without an expiration date that were created before October 22, 2019, will have their expiration date set to April 22, 2021.\n   * _For example, if the last pipeline job you created and ran successfully was on July 10th, 2019, then the job artifacts produced by that job upon completion would have a default expiration date automatically set to April 22nd, 2021, unless you specify a different expiration date._\n\nFor additional details, please see our [GitLab.com CI/CD settings documentation](https://docs.gitlab.com/ee/user/gitlab_com/index.html#gitlab-cicd).\n\n### **Default archive of jobs on GitLab.com set to 3 months**\n\nWhat you need to know:\n* Starting June 22, 2020 jobs older than 12 months will be archived.\n* Starting August 6, 2020 jobs older than 6 months will be archived.\n* Starting September 22, 2020 jobs older than 3 months will be archived. This will be the default setting going forward on GitLab.com.\n* New build data will be archived 3 months after creation starting June 22, 2020.\n* New job metadata will be archived 3 months after creation starting June 22, 2020.\nFor additional details, please see our GitLab.com CI/CD settings documentation, linked above.\n\n## Additional context\n\n### What are we doing and why?\n\nThe functionality changes will set build artifacts to expire on GitLab.com after 30 days and archive [build metadata](/blog/building-build-images/) after 3 months. In most cases, you probably aren't utilizing old job artifacts that are sitting in storage and will be set to expire. This same concept applies to old metadata for jobs that aren't getting used and you may have even forgotten about.\n\n### When do the changes take effect?\n\nNew artifacts will have their expiration date set to 30 days by default starting June 22nd, 2020. Existing artifacts will automatically have their expiration dates set to the updated policy if they aren't already set by the aforementioned date. The last of those existing artifacts will have an expiration date of 12 months.\n\nAdditionally, build data older than 12 months will be archived on June 22, 2020 and any other data will be gradually set to expire within 3 months of the original creation date over the next 3 months.\n\n### What if I want to keep my artifacts and data? How can I do that?\n\nDon't worry! You can always override artifact expiration with the 'expire_in' keyword to keep a job artifact longer than 30 days. Additionally, artifacts from the [last successful job](https://gitlab.com/gitlab-org/gitlab/-/issues/16267) that were created after September 22, 2020, will not be archived after 30 days.\n\n### What's the benefit to end users?\n\nThese actions will result in improved performance and value for everyone using GitLab. By making these changes, we'll free up a significant amount of space in both the GitLab.com database and on disk, which results in better reliability/performance. It also reduces operational costs for GitLab, which aids us in continuing to provide all users with the smoothest possible user experience far into the future!\n\n",[9,1349,864],"releases",{"slug":1351,"featured":6,"template":688},"job-artifact-meta-data-expiration-change","content:en-us:blog:job-artifact-meta-data-expiration-change.yml","Job Artifact Meta Data Expiration Change","en-us/blog/job-artifact-meta-data-expiration-change.yml","en-us/blog/job-artifact-meta-data-expiration-change",{"_path":1357,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1358,"content":1364,"config":1371,"_id":1373,"_type":13,"title":1374,"_source":15,"_file":1375,"_stem":1376,"_extension":18},"/en-us/blog/less-headaches",{"title":1359,"description":1360,"ogTitle":1359,"ogDescription":1360,"noIndex":6,"ogImage":1361,"ogUrl":1362,"ogSiteName":672,"ogType":673,"canonicalUrls":1362,"schema":1363},"Two DevOps platform superpowers: Visibility and actionability","Migrating to a DevOps platform helps organizations better understand and improve their development lifecycle.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749668622/Blog/Hero%20Images/group-rowing-collaboration.jpg","https://about.gitlab.com/blog/less-headaches","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Two DevOps platform superpowers: Visibility and actionability\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sharon Gaudin\"}],\n        \"datePublished\": \"2022-09-26\",\n      }",{"title":1359,"description":1360,"authors":1365,"heroImage":1361,"date":1367,"body":1368,"category":730,"tags":1369},[1366],"Sharon Gaudin","2022-09-26","\nA [DevOps platform](/blog/the-journey-to-a-devops-platform/) deployed as a single application takes DevOps gains to the next level, enabling teams to deliver more value to their organization with fewer headaches. A platform, which includes the ability to plan, develop, test, secure, and operate software, empowers teams to deliver software faster, more efficiently, and more securely. And that [makes the business more competitive and more agile](/blog/the-devops-platform-series-building-a-business-case/).\n\n## DevOps visability and actionability\n\nA complete DevOps platform gives organizations everything they need to turn ideas into valuable and secure software without the time-consuming and costly headaches that multiple tools and multiple UXes bring. A single, end-to-end platform also gives teams one data store sitting underneath everything they do, and, regardless of the interface they are using, allows them to easily surface insights about developer productivity, workflow efficiency, and DevOps practice adoption.\n\nThere are many benefits to a DevOps platform, including visibility and actionability.\n\n### Gain visibility and context\n\nA DevOps platform enables DevOps teams to see and understand what’s happening in their organization, and provide context for those events. With insights that go much deeper than what a simple report or dashboard can offer, DevOps teams can better understand the status of projects, as well as their impact.\n\n### Take action more easily\n\nActionability means users can take that contextual information and efficiently and quickly do something with it at the point of understanding. Users can move a project ahead more quickly because they don’t have to wait to have a synchronous conversation or meeting to review the new information.\n\nHere are a few ways that an end-to-end platform provides visibility and actionability.\n\n### Track projects with epics and issues\n\nIn a DevOps platform, users are better able to communicate, plan work, and collaborate by using epics and issues. [Epics](https://docs.gitlab.com/ee/user/group/epics/) are an overview of a project, idea, or workflow. Issues are used to organize and list out what needs to be done to complete the larger goal, to track tasks and work status, or work on code implementations.\n\nFor instance, if managers want an overview of how multiple projects, programs, or products are progressing, they can get that kind of visibility by checking an epic, which will give them a high-level rollup view of what is being worked on, what has been completed, and what is on schedule or delayed. Users can call up an epic to quickly see what’s been accomplished and what is still under way, and then they can dig deeper into sub-epics and related issues for more information.\n\n[Issues](https://docs.gitlab.com/ee/user/project/issues/) offer details about implementation of specific goals, trace collaboration on that topic, and show which parts of the initiative team members are taking on. Users also can see whether due dates have been met or not. Issues can be used to reassign pieces of work, give updates, make comments or suggestions, and see how the nuts and bolts are being created and moved around.\n\n### Labels help track and search projects\n\nLabels are classification tags, which are often assigned colors and descriptive titles like \"bug\", \"feature request\", or \"docs\" to make them easy to understand. They are used in epics, issues, and merge requests to help users organize their work and ideas. They give users at-a-glance insight about what teams are working on a project, the focus of the work, and where it stands in the development lifecycle. Labels can be added and removed as work progresses to enable better tracking and searching.\n\n### Dashboards help track metrics\n\nDashboards are reporting tools that pull together metrics from multiple tools to create an at-a-glance view of projects, [security issues](/blog/secure-stage-for-appsec/), the health of different environments, or requests coming in for specific departments or teams, for instance. DevOps platform users can set up live dashboards to see trends in real time, map processes, and track response times, [errors](/blog/iteration-on-error-tracking/), and deployment speed. Dashboards also can be used to see alert statuses and the effect on specific applications and the business overall.\n\n### Value stream analytics\n\nFor visibility without any customization required, there are [value stream analytics](/blog/gitlab-value-stream-analytics/). This interface automatically pulls in data to show users how long it takes the team to complete each stage in their workflow – across planning, development, deployment, and monitoring. This gives developers or product owners – or anyone who wants information on workflow efficiency –  [a look at high-level metrics](/solutions/value-stream-management/), like deployment frequency. This is actionable information so it also shows what part of the project is taking the most time or what is holding up progress. Based on this information, the user can suggest changes, like moving milestones or assigning the work to someone new, and enact those changes with just one click.\n\nWith a DevOps platform, teams have end-to-end visibility that also is actionable. By enabling users to find the information they need with the context they need and giving them the ability to make immediate changes, data becomes actionable. Using a single platform, teams can move projects along more quickly, iterate faster, and create more value and company agility.\n\nCheck out our [Migrating to a DevOps platform eBook](https://page.gitlab.com/migrate-to-devops-guide.html?_gl=1*6p1rz*_ga*MTI3MzMwNjYwMi4xNjYyOTg0OTAw*_ga_ENFH3X7M5Y*MTY2Mzk0NDY1Mi4zOS4xLjE2NjM5NDQ2NjEuMC4wLjA.) for even more useful information about how to complete a successful DevOps platform migration\n\n",[707,1370,9],"growth",{"slug":1372,"featured":6,"template":688},"less-headaches","content:en-us:blog:less-headaches.yml","Less Headaches","en-us/blog/less-headaches.yml","en-us/blog/less-headaches",{"_path":1378,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1379,"content":1385,"config":1391,"_id":1393,"_type":13,"title":1394,"_source":15,"_file":1395,"_stem":1396,"_extension":18},"/en-us/blog/lessons-in-iteration-from-new-infrastructure-team",{"title":1380,"description":1381,"ogTitle":1380,"ogDescription":1381,"noIndex":6,"ogImage":1382,"ogUrl":1383,"ogSiteName":672,"ogType":673,"canonicalUrls":1383,"schema":1384},"Lessons in iteration from a new team in infrastructure","A new, small team at GitLab discovered that minimum viable change applies to scaling problems too.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749681724/Blog/Hero%20Images/skateboard-iteration.jpg","https://about.gitlab.com/blog/lessons-in-iteration-from-new-infrastructure-team","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Lessons in iteration from a new team in infrastructure\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sean McGivern\"}],\n        \"datePublished\": \"2020-11-09\",\n      }",{"title":1380,"description":1381,"authors":1386,"heroImage":1382,"date":1388,"body":1389,"category":681,"tags":1390},[1387],"Sean McGivern","2020-11-09","\n\nThe [Scalability Team][scalability] has the goal of understanding\npotential scaling bottlenecks in our application. We formed a year ago\nwith one person, and as of early 2020, we are made up of three backend\nengineers, plus one site reliability engineer. We are a\nsort of [program team] so we have a wide remit, and there's only one\nsimilar team at GitLab: our sibling [Delivery Team][delivery]. All of\nthe backend engineers in the team (including me) came from\nworking on product development rather than infrastructure work.\n\n[scalability]: /handbook/engineering/infrastructure/team/scalability/\n[program team]: https://lethain.com/programs-owning-the-unownable/\n[delivery]: /handbook/engineering/infrastructure/team/delivery/\n\nWe recently finished a project where we [investigated our use of\nSidekiq][sidekiq] and made various improvements. We decided to continue\nthe same approach of looking at services, and got started with our next\ntarget of Redis. Here are some lessons we took away:\n\n[sidekiq]:/blog/scaling-our-use-of-sidekiq/\n\n## 1. Don't lose sight of what matters most: impact\n\nWe chose to split our work on Redis into three phases:\n\n1. [Visibility][v]: increase visibility into the service.\n2. [Triage][t]: use our increased visibility to look for problems and\n   potential improvements, and triage those.\n3. [Knowledge sharing][ks]: share what we learned with the rest of the\n   Engineering department.\n\n[v]: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/309\n[t]: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/309\n[ks]: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/265\n\n[Iteration] is crucial at GitLab, so much so that we have regular\n[Iteration Office Hours]. On the surface, you could say that we were\niterating here: our issues were small and well-scoped and we were\ndelivering code to production regularly.\n\n[Iteration]: https://handbook.gitlab.com/handbook/values/#iteration\n[Iteration Office Hours]: /handbook/ceo/#iteration-office-hours\n\nThe problem, as it turned out, was that we were focused so heavily on\nunderstanding the service, that we lost track of the [results] we were\ntrying to deliver. Our [values hierarchy] puts results at the top, but\nwe hadn't given the results enough attention. We are a small team that\nneeds to cover a wide area, and we need to deliver _impactful_ changes.\n\n[results]: https://handbook.gitlab.com/handbook/values/#results\n[values hierarchy]: https://handbook.gitlab.com/handbook/values/#hierarchy\n\nThere are some [examples in our handbook][impact] – which we've added as\na result of this project – but we define impact as either having a\ndirect effect on the platform, our infrastructure, or our development\nteams. That was what was missing here, because the impact was loaded\ntowards the very end of the project: largely in the knowledge sharing\nsection.\n\n[impact]: /handbook/engineering/infrastructure/team/scalability/#impact\n\nWe spent a long time (several months) improving our visibility, which\ndefinitely has a positive impact on our SREs who spend time\ninvestigating incidents. But we could have delivered this value and more\nin a shorter time period, if we had kept clear sights on the impact we\nwanted to have.\n\n## 2. Minimum viable change applies to scaling problems too\n\nWith that framing in mind, it's quite clear that we weren't iterating in\nthe best way. To use a famous example, it's like we'd started building a\ncar by building the wheels, then the chassis, etc. That takes a long\ntime to get something useful. We could have started by [building a\nskateboard]. We didn't have a good sense of what a [minimum viable change](https://handbook.gitlab.com/handbook/values/#minimal-viable-change-mvc)\nwas for our team, so we got it wrong.\n\n[building a skateboard]: https://blog.crisp.se/2016/01/25/henrikkniberg/making-sense-of-mvp\n\n![Building a skateboard iteration](https://about.gitlab.com/images/blogimages/scalability-redis-efficiency-skateboard.png){: .medium.center}\nIllustration by [Henrik Kniberg](https://blog.crisp.se/2016/01/25/henrikkniberg/making-sense-of-mvp)\n{: .note.text-right}\n\nWhat would a minimum viable change look like? When we worked on this project, we\ncovered several topics: adding Redis calls to our standard structured\nlogs, exposing slow log information, and so on. With hindsight, the best\nway would probably be to slice the project differently. We could take\nthe three steps above (visibility, triage, knowledge sharing), but\nconsider them all to be necessary for a project on a single topic with a\ntangible goal.\n\nWe did this, with all the impact at the end:\n\n![Working through the first step for all topics, the second step for all topics, and finally having impact in the third step](https://about.gitlab.com/images/blogimages/scalability-redis-efficiency-before.jpg)\n\nBut traveling in the other direction would have been much more\neffective:\n\n![Working through all steps for the first topic, having impact, then starting again at the second topic](https://about.gitlab.com/images/blogimages/scalability-redis-efficiency-after.jpg)\n\nThis leads to a state where:\n\n1. The impact we make is clearer.\n2. We start making an impact sooner.\n3. We can re-assess after every project, and stop early once we have\n   done enough.\n\nThe sooner we have this impact, the sooner we can see the results of\nwhat we've done. It's also good for morale to see these results on a\nregular basis!\n\n## 3. Shape your projects to deliver impact throughout\n\nThe way that we originally structured our work to improve Redis usage made it harder to see\nour impact than it should have been. For example, we [updated our\ndevelopment documentation][dev-docs-update] at the end of the project.\nThis was useful, but it would have been much more useful to backend\nengineers if we'd updated the documentation along the way, so they always had the best information we could give them.\n\n[dev-docs-update]: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/41889\n\nFor a more positive example: in the visibility stage, we created\na couple of issues directly for stage groups to address, rather than\nwaiting for the triage or knowledge sharing stage to do so. One of those\nissues was about [large cache entries for merge request\ndiscussions][mr-cache]. By getting this in front of the relevant\ndevelopment team earlier, we were able to\nget the fix scheduled and completed sooner as well.\n\n[mr-cache]: https://gitlab.com/gitlab-org/gitlab/-/issues/225600\n\nRegularly delivering projects with clear impact means that we get\nfeedback earlier (from engineers in Development and Infrastructure, or\nfrom the infrastructure itself), we can cover a wider area in less time,\nand we are happier about the work we're doing.\n\nAs people who went from working directly on user-facing features to\nworking on a property of the system as a whole, we learned that we can\nstill set ourselves an MVC to keep us on the right path, as long as we\nthink carefully about the results we want to achieve.\n\n[Cover image](https://unsplash.com/@viniciusamano?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText) by shawn henry on [Unsplash](https://unsplash.com/s/photos/skateboard?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText)\n{: .note}\n",[754,9,732],{"slug":1392,"featured":6,"template":688},"lessons-in-iteration-from-new-infrastructure-team","content:en-us:blog:lessons-in-iteration-from-new-infrastructure-team.yml","Lessons In Iteration From New Infrastructure Team","en-us/blog/lessons-in-iteration-from-new-infrastructure-team.yml","en-us/blog/lessons-in-iteration-from-new-infrastructure-team",{"_path":1398,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1399,"content":1405,"config":1411,"_id":1413,"_type":13,"title":1414,"_source":15,"_file":1415,"_stem":1416,"_extension":18},"/en-us/blog/migrating-to-puma-on-gitlab",{"title":1400,"description":1401,"ogTitle":1400,"ogDescription":1401,"noIndex":6,"ogImage":1402,"ogUrl":1403,"ogSiteName":672,"ogType":673,"canonicalUrls":1403,"schema":1404},"How we migrated application servers from Unicorn to Puma","It's been a long journey but with the release of GitLab 13.0 Puma is our default application server. Here's what we did and learned along the way.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749681413/Blog/Hero%20Images/appserverpuma.jpg","https://about.gitlab.com/blog/migrating-to-puma-on-gitlab","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we migrated application servers from Unicorn to Puma\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Craig Gomes\"}],\n        \"datePublished\": \"2020-07-08\",\n      }",{"title":1400,"description":1401,"authors":1406,"heroImage":1402,"date":1408,"body":1409,"category":681,"tags":1410},[1407],"Craig Gomes","2020-07-08","\n\nIt’s been years in the making, but our journey to migrate our application servers from Unicorn to Puma is complete. With the Gitlab 12.9 release Puma was running on GitLab.com and now with 13.0 it is the default application server for everyone. This is the story about how we migrated from Unicorn to Puma and the results we’ve seen.\n\n## A starting point\n\nBoth [Unicorn](https://yhbt.net/unicorn/) and [Puma](https://puma.io) are web servers for Ruby on Rails. The big difference is that Unicorn is a single-threaded process model and Puma uses a multithreaded model. \n\nUnicorn has a multi-process, single-threaded architecture to make better use of available CPU cores (processes can run on different cores) and to have stronger fault tolerance (most failures stay isolated in only one process and cannot take down GitLab entirely). On startup, the Unicorn ‘main’ process loads a clean Ruby environment with the GitLab application code, and then spawns ‘workers’ which inherit this clean initial environment. The ‘main’ never handles any requests; that is left to the workers. The operating system network stack queues incoming requests and distributes them among the workers.\n\nUnlike Unicorn, Puma can run multiple threads for each worker. Puma can be tuned to run multiple threads and workers to make optimal use of your server and workload. For example, in Puma defining \"N workers\" with 1 thread is essentially equivalent to \"N Unicorn workers.\" In multi-threaded processes thread safety is critical to ensure proper functionality. We encountered one thread safety issue while migrating to Puma and we'll get to that shortly.\n\n### Technical Descriptions\n\nUnicorn is an HTTP server for Rack applications designed to only serve fast clients on low-latency, high-bandwidth connections and take advantage of features in Unix/Unix-like kernels. Slow clients should only be served by placing a reverse proxy capable of fully buffering both the the request and response in between unicorn and slow clients.\n\nPuma is a multi-threaded web server and our replacement for Unicorn. Unlike other Ruby Webservers, Puma was built for speed and parallelism. Puma is a small library that provides a very fast and concurrent HTTP 1.1 server for Ruby web applications. It is designed for running Rack apps only.\n\nWhat makes Puma so fast is the careful use of a Ragel extension to provide fast, accurate HTTP 1.1 protocol parsing. This makes the server scream without too many portability issues.\n\n## Why Puma?\n\nWe began early investigations into Puma believing it would help resolve some of our [memory growth issues](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/3700) and also to help with scalability. By switching from Unicorn's single threaded process we could cut down on the number of processes running and the memory overhead of each of these processes. Ruby processes take up a significant amount of memory.  Threads, on the other hand, consume a much smaller amount of memory than workers because they are able to share a significantly larger portion of application memory.  When I/O causes a thread to pause, another thread can continue with its application request. In this way, multi-thread makes the best use of the available memory and CPU, reducing memory consumption by [approximately 40%](/releases/2020/05/22/gitlab-13-0-released/#reduced-memory-consumption-of-gitlab-with-puma).\n\n## The early appearance of Puma\n\nThe first appearance of Puma in a GitLab issue was in a discussion about using [multithreaded application servers](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/3592), dating back to November 20, 2015. In our spirit of iteration, the first attempt at adding experimental support for Puma followed shortly after with a [merge request](https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/1899) on November 25, 2015. The initial [results](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/3592#note_2805965) indicated a lack of stability and thus did not merit us moving forward with Puma at the time. While the push [to improve our memory footprint](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/25421) continued, the efforts to move forward with Puma stalled for a while.\n\n## Experimental development use\n\nIn May, 2018 Puma was configured for [experimental development use](https://gitlab.com/gitlab-org/gitlab-development-kit/-/merge_requests/532) in GitLab Rails and [Omnibus](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/2801). Later that year, we added [Puma metrics to Prometheus](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/52769) to track our internal experimental usage of Puma. By early spring of 2019 GitLab moved forward with the creation of the [Memory Team](/blog/why-we-created-the-gitlab-memory-team/) whose early set of identified tasks was to deploy Puma to GitLab.com.\n\n\n## Implementation steps\n\nThe efforts to implement Puma on GitLab.com and for our self-managed customers started in earnest in early 2019 with the [Enable Puma Web Server for GitLab](https://gitlab.com/groups/gitlab-org/-/epics/954) epic and the creation of the Memory Team. One of the early steps we took was to [enable Puma by default in the GDK ](https://gitlab.com/gitlab-org/gitlab-development-kit/-/issues/490) to get metrics and feedback from the community and our customers while we worked to deploy on GitLab.com.\n\nThe ability to measure the improvements achieved by the Puma deployment was critical to determining whether we had achieved our goals of overall memory reduction. To capture these metrics we set up [two identical environments](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/62877) to test changes on a daily basis. This would allow us to quickly make changes to the worker/thread ratio within Puma and quickly review the impact of the changes.\n\n### A roll out plan\n\nWe have multiple pre-production environments and we follow a progression of deploying Puma to each of these stages (dev->ops->staging->canary->production). Within each of these stages we would deploy the changes to enable Puma and test the changes. Once we confirmed a successful deployment we would measure and make configuration changes for optimal performance and memory reduction.\n\n### Issues and Tuning\n\nEarly on we determined that our usage of [ChronicDuration](https://gitlab.com/gitlab-org/gitlab/-/issues/31285) was not thread-safe. We ended up [forking the code](https://gitlab.com/gitlab-org/gitlab/-/issues/31285#note_215961555) and distributing our own [gitlab-chronic-duration](https://gitlab.com/gitlab-org/gitlab-chronic-duration) to solve our thread-safety issues.\n\nWe encountered only minor issues in the previous environments but once we deployed to Canary our infrastructure team reported some [unacceptable latency issues](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7455#note_239070865). We spent a significant amount of time tuning [Puma](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8334) for the optimal configuration of workers to threads. We also discovered some changes required to our [health-check endpoint](https://gitlab.com/gitlab-org/omnibus-gitlab/issues/4835) to ensure minimal to no downtime during upgrades.\n\n### Puma Upstream Patch\n\nAs we zeroed in on tuning GitLab.com with Puma we discovered that the capacity was not being evenly distributed. Puma capacity is calculated by `workers * threads`, so if you have 2 workers and 2 threads you have a capacity of 4. Since Puma uses round-robin to schedule requests, and no other criteria, we saw evidence of some workers being saturated while others sat idle. The simple [fix](https://github.com/puma/puma/pull/2079/files) proposed by [Kamil Trzcinski](https://gitlab.com/ayufan) was to make Puma inject a minimal amount of latency between requests if the worker is already processing requests. This would allow other workers (that are idle) to accept socket much faster than our worker that is already processing other traffic.\n\nYou can read more details about the discovery and research [here](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8334#note_247859173).\n\n## Our results\n\nOnce we deployed Puma to our entire web fleet we observed a drop in memory usage from 1.28T to approximately 800GB (approximately a 37% drop) while our request queuing, request duration and CPU usage all remained roughly the same.\n\nMore details and graphs can be found [here](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1684#note_291225063). \n\nPuma is now on by default for all GitLab customers in the [GitLab 13.0 release](/releases/2020/05/22/gitlab-13-0-released/).\n\n## What's next\n\nWe want to review our infrastructure needs! The efficiency gains brought about by deploying Puma will allow us to re-examine the memory needs of Rails nodes in production. \n\nAlso, Puma has enabled us to continue to pursue our efforts to enable [real time editing](https://gitlab.com/groups/gitlab-org/-/epics/52). \n\n**More about GitLab's infrastructure:**\n\n[How we scaled Sidekiq](/blog/scaling-our-use-of-sidekiq/)\n\n[Make your pipelines more flexible](/blog/directed-acyclic-graph/)\n\n[The inside scoop on the building of our Status Page](/blog/how-we-built-status-page-mvc/)\n\nCover image by [John Moeses Bauan](https://unsplash.com/@johnmoeses) on [Unsplash](https://www.unsplash.com)\n{: .note}\n",[754,823,9],{"slug":1412,"featured":6,"template":688},"migrating-to-puma-on-gitlab","content:en-us:blog:migrating-to-puma-on-gitlab.yml","Migrating To Puma On Gitlab","en-us/blog/migrating-to-puma-on-gitlab.yml","en-us/blog/migrating-to-puma-on-gitlab",{"_path":1418,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1419,"content":1425,"config":1431,"_id":1433,"_type":13,"title":1434,"_source":15,"_file":1435,"_stem":1436,"_extension":18},"/en-us/blog/monitor-application-performance-with-distributed-tracing",{"title":1420,"description":1421,"ogTitle":1420,"ogDescription":1421,"noIndex":6,"ogImage":1422,"ogUrl":1423,"ogSiteName":672,"ogType":673,"canonicalUrls":1423,"schema":1424},"Monitor application performance with Distributed Tracing","Learn how Distributed Tracing helps troubleshoot application performance issues by providing end-to-end visibility and seamless collaboration across your organization.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098000/Blog/Hero%20Images/Blog/Hero%20Images/blog-image-template-1800x945%20%288%29_5x6kH5vwjz8cwKgSBh1w11_1750098000511.png","https://about.gitlab.com/blog/monitor-application-performance-with-distributed-tracing","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Monitor application performance with Distributed Tracing\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sacha Guyon\"}],\n        \"datePublished\": \"2024-06-13\",\n      }",{"title":1420,"description":1421,"authors":1426,"heroImage":1422,"date":1428,"body":1429,"category":781,"tags":1430},[1427],"Sacha Guyon","2024-06-13","Downtime due to application defects or performance issues can have devastating financial consequences for businesses. An hour of downtime is estimated to cost firms $301,000 or more, according to [Information Technology Intelligence Consulting's 2022 Global Server Hardware and Server OS Reliability Survey](https://itic-corp.com/server-and-application-by-the-numbers-understanding-the-nines/). These issues often originate from human-introduced changes, such as code or configuration changes.\n\nResolving such incidents requires development and operations teams to collaborate closely, investigating the various components of the system to find the root cause change, and promptly restore the system back to normal operation. However, these teams commonly use separate tools to build, manage, and monitor their application services and infrastructure. This approach leads to siloed data, fragmented communication, and inefficient context switching, increasing the time spent to detect and resolve incidents.\n\nGitLab aims to address this challenge by combining software delivery and monitoring functionalities within the same platform. Last year, we released [Error Tracking](https://docs.gitlab.com/ee/operations/error_tracking.html) as a general availability feature in [GitLab 16.0](https://about.gitlab.com/releases/2023/05/22/gitlab-16-0-released/#error-tracking-is-now-generally-available). Now, we're excited to announce the [Beta release of Distributed Tracing](https://docs.gitlab.com/ee/operations/tracing), the next step toward a comprehensive observability offering seamlessly integrated into the GitLab DevSecOps platform.\n\n## A new era of efficiency: GitLab Observability\n\nGitLab Observability empowers development and operations teams to visualize and analyze errors, traces, logs, and metrics from their applications and infrastructure. By integrating application performance monitoring into existing software delivery workflows, context switching is minimized and productivity is increased, keeping teams focused and collaborative on a unified platform.\n\nAdditionally, GitLab Observability bridges the gap between development and operations by providing insights into application performance in production. This enhances transparency, information sharing, and communication between teams. Consequently, they can detect and resolve bugs and performance issues arising from new code or configuration changes sooner and more effectively, preventing those issues from escalating into major incidents that could negatively impact the business.\n\n## What is Distributed Tracing?\n\nWith Distributed Tracing, engineers can identify the source of application performance issues. A trace represents a single user request that moves through different services and systems. Engineers are able to analyze the timing of each operation and any errors as they occur.\n\nEach trace is composed of one or more spans, which represent individual operations or units of work. Spans contain metadata like the name, timestamps, status, and relevant tags or logs. By examining the relationships between spans, developers can understand the request flow, identify performance bottlenecks, and pinpoint issues.\n\nDistributed Tracing is especially valuable for [microservices architecture](https://about.gitlab.com/topics/microservices/), where a single request may involve numerous service calls across a complex system. Tracing provides visibility into this interaction, empowering teams to quickly diagnose and resolve problems.\n\n![tracing example](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098009/Blog/Content%20Images/Blog/Content%20Images/image4_aHR0cHM6_1750098009139.png)\n\nFor example, this trace illustrates a how a user request flows through difference services to fetch product recommendations on a e-commerce website:\n\n- `User Action`: This indicates the user's initial action, such as clicking a button to request product recommendations on a product page.\n-  `Web front-end`: The web front-end sends a request to the recommendation service to retrieve product recommendations.\n- `Recommendation service`: The request from the web front-end is handled by the recommendation service, which processes the request to generate a list of recommended products.\n- `Catalog service`: The recommendation service calls the catalog service to fetch details of the recommended products. An alert icon suggests an issue or delay at this stage, such as a slow response or error in fetching product details.\n- `Database`: The catalog service queries the database to retrieve the actual product details. This span shows the SQL query in the database.\n\nBy visualizing this end-to-end trace, developers can identify performance issues – here, an error in the Catalog service – and quickly diagnose and resolve issues across the distributed system.\n\n![End-to-end trace](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098009/Blog/Content%20Images/Blog/Content%20Images/image1_aHR0cHM6_1750098009140.png)\n\n## How Distributed Tracing works\n\nHere is a breakdown of how Distributed Tracing works.\n\n### Collect data from any application with OpenTelemetry\n\nTraces and spans can be collected using [OpenTelemetry](https://opentelemetry.io/docs/what-is-opentelemetry/), an open-source observability framework that supports a wide array of SDKs and libraries across [major programming languages and frameworks](https://opentelemetry.io/docs/languages/). This framework offers a vendor-neutral approach for collecting and exporting telemetry data, enabling developers to avoid vendor lock-in and choose the tools that best fit their needs.\n\nThis means that if you are already using OpenTelemetry with another vendor, you can send data to us simply by adding our endpoint to your configuration file, making it very easy to try out our features!\n\n![Distributed tracing workflow diagram](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098009/Blog/Content%20Images/Blog/Content%20Images/image5_aHR0cHM6_1750098009141.png)\n\n### Ingest and retain data at scale with fast, real-time queries\n\nObservability requires the storage and querying of vast amounts of data while maintaining low latency for real-time analytics. To meet these needs, we developed a horizontally scalable, long-term storage solution using ClickHouse and Kubernetes, based on our [acquisition of Opstrace](https://about.gitlab.com/press/releases/2021-12-14-gitlab-acquires-opstrace-to-expand-its-devops-platform-with-open-source-observability-solution/). This [open-source platform](https://gitlab.com/gitlab-org/opstrace/opstrace) ensures rapid query performance and enterprise-grade scalability, all while minimizing costs.\n\n### Explore and analyze traces effortlessly\nAn advanced, native-level user interface is crucial for effective data exploration. We built such an interface from the ground up, starting with our Trace Explorer, which allows users to examine traces and understand their application's performance:\n- __Advanced filtering:__ Filter by services, operation names, status, and time range. Autocomplete helps simplify querying.\n- __Error highlighting:__ Easily identify error spans in search results.\n- __RED metrics:__ Visualize the Requests rate, Errors rate, and average Duration as a time-series chart for any search in real-time.\n- __Timeline view:__ Individual traces are displayed as a waterfall diagram, providing a complete view of a request distributed across different services and operations.\n- __Historical data:__ Users can query traces up to 30 days in the past.\n\n![Distributed Tracing - image 5](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098009/Blog/Content%20Images/Blog/Content%20Images/image3_aHR0cHM6_1750098009141.png)\n\n## How we use Distributed Tracing at GitLab\n[Dogfooding](https://handbook.gitlab.com/handbook/values/#dogfooding) is a core value and practice at GitLab. We've been already using early versions of Distributed Tracing for our engineering and operations needs. Here are a couple example use cases from our teams:\n\n### 1. Debug errors and performance Issues in GitLab Agent for Kubernetes\n\nThe [Environments group](https://handbook.gitlab.com/handbook/engineering/development/ops/deploy/environments/) has been using Distributed Tracing to troubleshoot and resolve issues with the [GitLab Agent for Kubernetes](https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent), such as timeouts or high latency issues. The Trace List and Trace Timeline views offer valuable insights for the team to address these concerns efficiently. These traces are shared and discussed in the [related GitLab issues](https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/issues/386#note_1576431796), where the team collaborates on resolution.\n\n\u003Ccenter>\u003Ci>\"The Distributed Tracing feature has been invaluable in pinpointing where latency issues are occurring, allowing us to focus on the root cause and resolve it faster.\" - Mikhail, GitLab Engineer\u003C/i>\u003C/center>\u003Cp>\n\n### 2. Optimize GitLab’s build pipeline duration by identifying performance bottlenecks\n\nSlow deployments of GitLab source code can significantly impact the productivity of the whole company, as well as our compute spending. Our main repository runs [over 100,000 pipelines every month](https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts). If the time it takes for these pipelines to run changes by just one minute, it can add or remove more than 2,000 hours of work time. That's 87 extra days!\n\nTo optimize pipeline execution time, GitLab's [platform engineering teams](https://handbook.gitlab.com/handbook/engineering/infrastructure/) utilize a [custom-built tool](https://gitlab.com/gitlab-com/gl-infra/gitlab-pipeline-trace) that converts GitLab deployment pipelines into traces.\n\nThe Trace Timeline view allows them to visualize the detailed execution timeline of complex pipelines and pinpoint which jobs are part of the critical path and slowing down the entire process. By identifying these bottlenecks, they can optimize job execution – for example, making the job fail faster, or running more jobs in parallel – to improve overall pipeline efficiency.\n\n![Distributed Tracing - image 6](https://res.cloudinary.com/about-gitlab-com/image/upload/v1750098009/Blog/Content%20Images/Blog/Content%20Images/image2_aHR0cHM6_1750098009143.gif)\n\n[The script is freely available](https://gitlab.com/gitlab-com/gl-infra/gitlab-pipeline-trace), so you can adapt it for your own pipelines.\n\n\u003Ccenter>\u003Ci>\"Using Distributed Tracing for our deployment pipelines has been a game-changer. It's helped us quickly identify and eliminate bottlenecks, significantly reducing our deployment times.\"- Reuben, GitLab Engineer\u003C/i>\u003C/center>\u003Cp>\n\n## What's coming next?\n\nThis release is just the start: In the next few months, we'll continue to expand our observability and monitoring features with the upcoming Metrics and Logging releases. Check out [our Observability direction page](https://about.gitlab.com/direction/monitor/platform-insights/) for more info, and keep an eye out for updates!\n\n## Join the private Beta\n\nInterested in being part of this exciting journey? [Sign up to enroll in the private Beta](https://docs.gitlab.com/operations/observability/) and try out our features. Your contribution can help shape the future of observability within GitLab, ensuring our tools are perfectly aligned with your needs and challenges.\n\n> Help shape the future of GitLab Observability. [Join the Distributed Tracing Beta.](https://docs.gitlab.com/operations/observability/)",[9,823,948,481,987],{"slug":1432,"featured":90,"template":688},"monitor-application-performance-with-distributed-tracing","content:en-us:blog:monitor-application-performance-with-distributed-tracing.yml","Monitor Application Performance With Distributed Tracing","en-us/blog/monitor-application-performance-with-distributed-tracing.yml","en-us/blog/monitor-application-performance-with-distributed-tracing",{"_path":1438,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1439,"content":1445,"config":1451,"_id":1453,"_type":13,"title":1454,"_source":15,"_file":1455,"_stem":1456,"_extension":18},"/en-us/blog/monitoring-team-update",{"title":1440,"description":1441,"ogTitle":1440,"ogDescription":1441,"noIndex":6,"ogImage":1442,"ogUrl":1443,"ogSiteName":672,"ogType":673,"canonicalUrls":1443,"schema":1444},"How we plan to build more observability tools on GitLab monitoring","Get the scoop on our plan to close the DevOps loop.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749665484/Blog/Hero%20Images/monitoring-update-feature-image.jpg","https://about.gitlab.com/blog/monitoring-team-update","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we plan to build more observability tools on GitLab monitoring\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sara Kassabian\"}],\n        \"datePublished\": \"2019-08-29\",\n      }",{"title":1440,"description":1441,"authors":1446,"heroImage":1442,"date":1448,"body":1449,"category":298,"tags":1450},[1447],"Sara Kassabian","2019-08-29","\nThe product team at GitLab is working to close the DevOps loop by accelerating development\non new monitoring products that will offer more observability into application performance and\nthe health of your deployments.\n\n## Where does monitoring fit into the DevOps lifecycle?\n\n[Monitoring is the final Ops stage of the DevOps loop](/direction/monitor/), coming up after the\nproduction environment is configured and the application deployed. No developer should really\nship code and forget it. Monitoring is essential to proactively respond to simple and complex\nproblems, and helps GitLab customers uphold the expectations outlined in their service\nlevel objectives (SLOs) with their users.\n\n## Our vision for monitoring at GitLab\n\nWe outlined big plans for [building out our Ops capabilities](/blog/gitlabs-2018-product-vision/) in our 2018 GitLab product vision:\n“A big milestone for GitLab will be when operations people log into GitLab every day and consider\nit their main interface for getting work done.”\n\nSince then, GitLab has been working diligently to build out our monitoring products to close the\nDevOps loop. The goal is to build instrumentation that allows developers to proactively identify\nSLO degradation and observe the impacts of code changes across multiple deployments in real-time.\nThe \"North Stars\" that guide product development in the monitoring stage include:\n\n*   **Instrument with ease**: GitLab is set up so teams have generic observability into their\napplication performance.\n*   **Resolve like a pro**: GitLab correlates incoming observability data with CI/CD events and\nsource code information so troubleshooting is easy.\n*   **Gain insights seamlessly**: Our use of container-based deployments make it simpler to\ncontinuously collect insights into production SLOs, incidents, and observability sources across\ncomplex projects and multiple applications.\n\nOne of our [core principles at GitLab is to dogfood everything](/direction/monitor/#dogfooding) —\nafter all, if it doesn’t work for us, how can it work for our customers? We begin by\nsetting up our own infrastructure teams at GitLab.com\n[to use the incident management system](https://gitlab.com/groups/gitlab-org/-/epics/1672)\nwe’re developing, and also building out GitLab self-monitoring\nso our administrators can monitor their self-managed GitLab instance the same way their\ndevelopers use GitLab to monitor their applications.\n\nWe also are committed to closing the DevOps loop by prioritizing cloud native first,\nand building tooling designed to provide more insight in to application performance and the\nhealth of deployments for Ops professionals.\n\n[Kenny Johnston](/company/team/#kencjohnston), director of product (Ops) at GitLab, gave me an\noverview of some of the new products the monitoring team is working on to help make this\nvision a reality. Watch the full video of our conversation below and check out\nthe [monitoring product roadmap](https://gitlab.com/groups/gitlab-org/-/roadmap?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=devops%3A%3Amonitor)\nfor an in-depth look at our goals and timeline.\n\n\u003Cfigure class=\"video_container\">\n  \u003Ciframe src=\"https://www.youtube.com/embed/VFju_3R0hPg\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\n## Building an observability suite to close out the DevOps loop\n\nThe top priority for the monitoring team is to close the DevOps feedback loop for GitLab customers.\nThis means that if SLOs are degraded in any way, an alert is triggered and an incident is created\nin GitLab allowing for an immediate response.\n\nOur priority product categories at this stage are metrics, cluster monitoring, and incident management,\nsays Kenny.\n\n“First I want to make sure that we can provide our customers with the instrumentation so that they\ncan define an SLO, and when their application exceeds or fails to achieve that SLO, that they can\nrespond in an instant,” says Kenny. “Once we have them doing that, we'll get a lot of good\nfeedback, and immediate feedback from users about what tools they need for diagnostic purposes.”\n\n## Measure your performance with enhanced metrics\n\nWe already have a [successful integration](https://docs.gitlab.com/ee/user/project/integrations/prometheus.html)\nwith open source metrics tool, Prometheus, which we use to collect and display performance metrics\nfor applications deployed on Kubernetes. The integration is sophisticated enough that developers\ndo not have to leave GitLab to collect important information on the impact of a merge request or\nto monitor production systems. Our product category for metrics is “viable,”  meaning customers\nare using the instrumentation we’ve developed to solve real problems, bringing us a step closer to\nclosing out the DevOps loop.\n\nDiagnostic tooling in product categories such as logging, tracing, and error tracking for monitoring\napplication performance (APM) is currently at the MVC stage, though the team has made plans to\n[accelerate development on logging](https://youtu.be/nB5KDY4nsFg) in future GitLab deployments.\n\nKenny notes that our observability suite is one of the primary ways GitLab provides value for\noperators that are thinking of making the move to cloud native.\n\n“GitLab out-of-the-box keeps up with new cloud native technologies because we're constantly\nadopting the newest versions, and our whole convention of configuration means we don't\nleave it to you to figure it out, we've figured it out for you as a default,” explains Kenny.\n\n## Simplify Kubernetes management using GitLab\n\nThere is quite a bit of overlap between product category metrics and cluster monitoring at this\nstage, as Prometheus is used to collect metrics on applications deployed using Kubernetes.\nBy offering out-of-the-box cluster monitoring on Kubernetes, we make it possible for operators\nto monitor the health of their deployed environments all in one place.\n\nOne of the [high-value cluster monitoring features](https://docs.gitlab.com/ee/user/project/clusters/#monitoring-your-kubernetes-cluster)\nwe’ve set up on GitLab is memory usage and capacity metrics (CPU) administration,\nso users can be automatically alerted if either of those numbers are out of bounds on their deployed environments.\n\n“We'd like to start adding capabilities for\n[cluster cost optimization](https://gitlab.com/gitlab-org/gitlab-ee/issues/11879), so\ninforming users not just when they're hitting capacity but when they're significantly under\ncapacity and should probably size down,” says Kenny. “That helps users who've configured a\nKubernetes cluster to not end up wasting it because it's being underutilized and not end up wasting money.”\n\nCluster monitoring was brought to “viable” stages in earlier GitLab releases as we transition to\nKubernetes, but the [product team is building out alerting ](https://gitlab.com/gitlab-org/gitlab-ee/issues/5456)\nand other cluster monitoring features in upcoming releases.\n\n## Dogfooding our new incident management system on GitLab\n\nCreating an incident management system is key to a robust observability suite on monitoring:\n“The features we've prioritized are oriented towards getting the right person the right information\nto enable them to restore the services they are responsible for as quickly as possible,” according to\nthe [category vision for an incident management system](/direction/service_management/incident_management/).\n\nBecause we recognize the urgency of building a functional incident management system,\n[GitLab is leveraging issues](/direction/service_management/incident_management/index.html#high-level-design)\nas the base for creating a viable platform. The goal is to stress the capacity of our existing\ntooling by focusing on integrations with communications tools such as Slack, Zoom, etc., so we can\naccelerate time-to-market and iterate as we go, while also focusing on building out new functionality.\n\nThe infrastructure team on GitLab.com is [dogfooding the incident management system ](https://gitlab.com/groups/gitlab-org/-/epics/1672)\nso we can put the tooling through its paces, making improvements as we go.\n\n## Outside the loop: Getting GitLab administrators to monitor GitLab using GitLab\n\nKenny says the product team has a strategy for creating more exposure to the monitoring capabilities\nGitLab has in development: putting our monitoring capabilities front and center\nfor administrators of the GitLab self-managed instance.\n\n“Today you can create a project for your application that's an e-commerce app, and get the\ninstrumentation to know whether the Kubernetes cluster is experiencing pain, whether SLOs that\nyou custom define have alerts and respond to that with incidents,” says Kenny. “We'd like you to have\nthat exact same experience, or expose you to that same experience with your GitLab self-managed\ninstance, so that as an administrator you're using the same tools to monitor and respond to\nthe GitLab instance as your developers would use to monitor and respond to their applications.”\n\nBy essentially setting up administators to dogfood the monitoring features we are providing to\ndevelopers for application management, we can ensure that they're battle-tested on a larger application.\n\n## The core challenge of the observability suite\n\nWhile the product team at GitLab has a vision and roadmap for building a comprehensive suite of\nobservability instrumentation, there isn’t a clear consensus among monitoring experts as to what\nis required for a robust observability suite in this new, cloud native world.\n\n“There's varied opinion in the new world that's Kubernetes-based about what an observability\nsystem looks like,” says Kenny. “There's a legacy view that seems to be evolving. So, we need to keep up\nwith that and of the industry's evolution of what we consider required. We as a company just\nneed to stay focused on what our users are asking for, and that's why I think\ncompleting that DevOps loop is important first, because then we'll start getting immediate user feedback.”\n\nKeep an eye out for these new monitoring updates in our 12.2 and 12.3 releases.\n\nCover photo by Glen . on [Unsplash](https://unsplash.com/search/photos/binoculars?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).\n{: .note}\n",[710,9],{"slug":1452,"featured":6,"template":688},"monitoring-team-update","content:en-us:blog:monitoring-team-update.yml","Monitoring Team Update","en-us/blog/monitoring-team-update.yml","en-us/blog/monitoring-team-update",{"_path":1458,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1459,"content":1465,"config":1470,"_id":1472,"_type":13,"title":1473,"_source":15,"_file":1474,"_stem":1475,"_extension":18},"/en-us/blog/new-machine-types-for-gitlab-saas-runners",{"title":1460,"description":1461,"ogTitle":1460,"ogDescription":1461,"noIndex":6,"ogImage":1462,"ogUrl":1463,"ogSiteName":672,"ogType":673,"canonicalUrls":1463,"schema":1464},"GitLab introduces new machine types for GitLab SaaS Linux Runners","GitLab SaaS now offers larger machine types for running CI jobs on Linux.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749672836/Blog/Hero%20Images/multiple-machine-types-cover.png","https://about.gitlab.com/blog/new-machine-types-for-gitlab-saas-runners","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"GitLab introduces new machine types for GitLab SaaS Linux Runners\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Darren Eastman\"}],\n        \"datePublished\": \"2022-09-22\",\n      }",{"title":1460,"description":1461,"authors":1466,"heroImage":1462,"date":1467,"body":1468,"category":781,"tags":1469},[1122],"2022-09-22","\nOur GitLab SaaS vision is to provide a solution where you can easily choose and use the correct type of public cloud-hosted compute resources for your CI/CD jobs. In this first iteration towards achieving that vision, we are pleased to announce that two larger compute machines are generally available for GitLab SaaS Runners on Linux.\n\nWith these two machine types, you can now access more choices for your GitLab SaaS CI/CD jobs. And with 100% job isolation on an ephemeral virtual machine, and security and autoscaling fully managed by GitLab, you can confidently run your critical [CI/CD](/topics/ci-cd/) jobs on GitLab SaaS.\n\n## New machine type details\n\nThe new [SaaS Runners on Linux](https://docs.gitlab.com/ee/ci/runners/saas/linux_saas_runner.html) are a 2 vCPU, 8GB RAM (`saas-linux-medium-amd64`), and a 4 vCPU, 16GB RAM (`saas-linux-large-amd64`) machine type. These machine types, powered by the latest generation of Google Compute N2D virtual machines, deliver significant performance improvements for general-purpose CI workloads. The medium machine type, `saas-linux-medium-amd64`,  is available to all subscriptions (Free, Premium, Ultimate). The large machine type, `saas-linux-large-amd64` is only available to paid plans (Premium and Ultimate) and GitLab for Open Source program members.\n\nNote: If you are in a Free plan and tag a CI job with the large machine type, `saas-linux-large-amd64`, you will get an error at the job level and the job will not run.\n\n```\nThis job is stuck because of one of the following problems. There are no active runners online, no runners for the protected branch, or no runners that match all of the job's tags: saas-linux-large-amd64\n\n```\n\n## Are the new machine types right for my CI job?\n\nThe answer is that it depends. If the CI job is compute-intensive, you will likely see a performance improvement measured by reduced build times. We ran a series of  [Linux kernel](https://gitlab.com/gitlab-org/ci-cd/gitlab-runner-stress/linux-kernel) builds on the medium machine type to test the potential performance gains for compute-intensive CI jobs.\n\n![Linux kernel build CI job execution time benchmark](https://about.gitlab.com/images/blogimages/new-machine-types-gitlab-saas-linux/linux-kernel-build-runner-saas-benchmark_2022-09-22.png)\n\nOur testing found an average 41% reduction in CI job execution time for the medium machine types compared to the baseline small machine type. We recommend you experiment with the new machine types for your CI jobs to determine the right choice based on your build workflows.\n\n## Getting started\n\nTo get started with the new machine types, simply add a tag to your CI file. Without the tag, a job in your pipeline will automatically run on the small machine type.\n\n### Example pipeline configuration\n\nIn this example pipeline configuration, `job_001` will run on the default Linux SaaS Runner as no machine type tag is defined. The subsequent job, `job_002`, in the build stage will run on the medium machine type, and `job_003` will run on the large machine type. So you have flexibility within a GitLab CI/CD pipeline to choose the right machine type for each job.\n\n```\nstages:\n  - Prebuild\n  - Build\n  - Unit Test\n\njob_001:\n stage: Prebuild\n script:\n  - echo \"this job runs on the default (small) machine type\"\n\njob_002:\n tags: [ saas-linux-medium-amd64 ]\n stage: Build\n script:\n  - echo \"this job runs on the medium machine type\"\n\njob_003:\n tags: [ saas-linux-large-amd64 ]\n stage: Unit Test\n script:\n  - echo \"this job runs on the large machine type\"\n\n```\n\n## Understanding the new machine types and cost factors\n\nYou can start using the new machine types now with the CI minutes currently available in your plan. The new machine types will consume your CI minutes at a different rate than the default (small) machine type based on an applied cost factor. If you are a GitLab for Open Source program member, then refer to the [cost factor documentation page](https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html#cost-factor) for details on how cost factors are applied to your CI/CD jobs.\n\n|  | saas-linux-small-amd64 |saas-linux-medium-amd64 |saas-linux-large-amd64 |\n| ------ | ------ |------ |------ |\n| CI minutes consumed per 1 minute of build time| 1 |2|3|\n\nToday your CI minutes usage report on GitLab SaaS will be an aggregate of all of the CI minutes consumed across all the machine types you select in your jobs. In this [issue](https://gitlab.com/gitlab-org/gitlab/-/issues/356076), we are working towards adding visibility into usage by each Runner type. So you will soon have more granular reporting of use across the various Runner classes (Linux, Windows, macOS) and machine types we plan to offer.\n\n## Feedback\n\nAt GitLab, we value your input and use it as a critical sensing mechanism in planning roadmap investments. To provide feedback on the machine types you need on GitLab SaaS Runners on Linux, add a comment to the respective comment thread in this [issue](https://gitlab.com/gitlab-org/gitlab/-/issues/373196)\n\nCover image by [Julian Hochgesang](https://unsplash.com/@julianhochgesang) on [Unsplash](https://unsplash.com)\n{: .note}\n",[683,684,9,781],{"slug":1471,"featured":6,"template":688},"new-machine-types-for-gitlab-saas-runners","content:en-us:blog:new-machine-types-for-gitlab-saas-runners.yml","New Machine Types For Gitlab Saas Runners","en-us/blog/new-machine-types-for-gitlab-saas-runners.yml","en-us/blog/new-machine-types-for-gitlab-saas-runners",{"_path":1477,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1478,"content":1483,"config":1489,"_id":1491,"_type":13,"title":1492,"_source":15,"_file":1493,"_stem":1494,"_extension":18},"/en-us/blog/observability-vs-monitoring-in-devops",{"title":1479,"description":1480,"ogTitle":1479,"ogDescription":1480,"noIndex":6,"ogImage":1442,"ogUrl":1481,"ogSiteName":672,"ogType":673,"canonicalUrls":1481,"schema":1482},"Observability vs. monitoring in DevOps","Want to gain true and actionable visibility across your software development lifecycle? Observability is the answer.","https://about.gitlab.com/blog/observability-vs-monitoring-in-devops","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Observability vs. monitoring in DevOps\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Mike Vanbuskirk\"}],\n        \"datePublished\": \"2022-06-14\",\n      }",{"title":1479,"description":1480,"authors":1484,"heroImage":1442,"date":1486,"body":1487,"category":681,"tags":1488},[1485],"Mike Vanbuskirk","2022-06-14","\nIn almost any modern software infrastructure, there is inevitably some form of monitoring or logging. The launch of syslog for Unix systems in the 1980s established both the value of being able to audit and understand what is going on inside a system, as well as the architectural importance of separating that mechanism.\n\nHowever, despite the value and importance of this visibility into system behavior, too often monitoring and logging are treated as an afterthought. There are countless instances of systems emitting logs into a void, never being aggregated or analyzed for critical information. Or infrastructure where legacy monitoring systems were installed a decade ago and never updated to modern standards.\n\nRecently, shifts in the operational landscape have given rise to the concept of observability. Rather than expect engineers to form their own assumptions about how their application is performing from static measurements, observability enables them to see a holistic picture of their application behavior, and critically, how a user perceives performance.\n\n> You’re invited! Join us on June 23rd for the [GitLab 15 launch event](https://page.gitlab.com/fifteen) with DevOps guru Gene Kim and several GitLab leaders. They’ll show you what they see for the future of DevOps and The One DevOps Platform.\n\n## What is observability?\nTo understand the value in observability, it's helpful to first establish an understanding of what monitoring is, as well as what it does and does not provide in terms of information and context.\n\nAt its core, monitoring is presenting the results of measurements of different values and outputs of a given system or software stack. Common metrics for measurement are things like CPU usage, RAM usage, and response time or latency. Classic logging systems are similar; a static piece of information about an event that occurred during system operation.\n\nMonitoring provides limited-context measurements that might indicate a larger issue with the system. Aggregation and correlation are possible using traditional monitoring tools, but typically require manual configuration and tuning to provide a holistic view. As the industry has advanced, the concept of what makes for effective monitoring has moved beyond static measurements of things like CPU usage. In its now-famous SRE book, Google emphasizes that you should focus on four key metrics, known as \"[Golden Signals](https://sre.google/sre-book/monitoring-distributed-systems/)\":\n\n- Latency: The time it takes to fulfill a request\n- Traffic: High-level measurement of overall demand\n- Errors: The rate at which requests fail\n- Saturation: Measurement of resource usage as a fraction of the whole; typically focuses on constrained resources\n\nWhile these metrics help home in on a better picture of overall system performance, they still require a non-trivial engineering investment to design, build, integrate, and configure a complete monitoring system. There is considerable effort involved in enumerating failure modes, and manually defining and associating the correct correlations in even simple cases can be time-consuming.\n\nIn contrast, observability offers a much more intuitive and complete picture as a first-class feature: You don’t need to manually correlate disparate monitoring tooling. An aggregated monitoring dashboard is only as good as the last engineer that built it; conversely, an observability platform adapts itself to present critical information in the right context, automatically. This can even extend further left into the software development lifecycle (SDLC), with observability tooling providing important performance feedback during CI/CD runs, giving developers operational feedback about their code.\n\nUltimately, observability provides more holistic debugging and understanding. Observability data can show the “unknown unknowns” to better understand production incidents. For more context into \"why\" that's important, the next section highlights an excellent example where monitoring might fall short and where observability fills in the crucial story.\n\n## Why focus on observability?\nFocusing on observability can help drive down mean time to resolution (MTTR), resulting in shorter outages, better application performance, and improved customer experience. While it may seem at first glance that monitoring can provide the same advantages, consider the anecdote that follows.\n\nAn engineering organization gets a ping from the accounting department; the invoice for cloud services is getting expensive, so much so that the CFO has noticed. DevOps engineers have pored over the monitoring system to no avail; every part of the system has consistently reported being in the green for things like memory, CPU, and disk I/O. As it turns out, the root cause was another \"unknown unknown\" event: DNS latency in the CI/CD pipelines was causing builds to fail at an elevated rate. Builds needing more retries consumed a great number of cloud resources. However, this effect never persisted long enough to reflect in the monitoring system. By adding observability tooling and collecting all event types in the environment, ops was able to zero in on the source of the problem and remediate it. In a traditional monitoring system, the organization would have had to have known about the DNS latency problem a priori.\n\nObservability is also important for non-technical stakeholders and business units. As technology becomes more intertwined with the primary profit silo, software infrastructure KPIs become business KPIs. Observability can provide better insight into KPI performance, as well as self-service options for different teams.\n\nModern software and applications depend heavily on providing good user experience (UX). As the previous story illustrates, monitoring static metrics won't always tell the complete story about UX or system performance. There might be serious issues lurking behind seemingly healthy metric dashboards.\n\n## Key observability metrics\nFor organizations that have decided to implement observability tooling, the next step is to identify the core goals of observability, and how that can best be implemented across their stack.\n\nAn excellent place to start is with the three fundamental pillars of observability:\n- Logs: Information and Events\n- Metrics: Measurements of specific metrics and performance data\n- Tracing: Logging end-to-end request performance during runtime\n\nAlthough this can seem overwhelming, projects like [OpenTelemetry](https://opentelemetry.io/) are helping to drive broad standards acceptance for logging, metrics, and tracing, enabling a more consistent ecosystem and a shorter time-to-value for organizations that implement observability with tooling built on OpenTelemetry standards.\n\nAdditional observability data and pillars include\n- Error tracking: more granular logs with aggregation\n- Continuous Profiling: evaluating granular code performance\n- Real User Monitoring (RUM): Understand application performance from the perspective of an actual user\n\nLooking at these pillars, a central theme starts to emerge; it's no longer enough to look at a small slice of time and space in modern distributed systems, a holistic, 10,000-foot view is needed. Understanding application performance starts with sampling it as an actual customer experiences it, and then further monitoring the complete performance and behavior of their interaction with your software.\n\nBeyond traditional application monitoring, observability can help improve the operational excellence posture for any engineering organization. Well-crafted alerts and incident management programs are usually born out of hard lessons from real outages. Implementing [chaos engineering](https://principlesofchaos.org/) can test observability platforms during real failures, albeit in a controlled environment with known outcomes. Introducing chaos engineering into systems where \"unknown unknowns\" might hide, not just in your production workloads but your CI/CD pipelines, supply chain, and DNS can yield significant gains in operational footing.\n\n## Observability is a critical part of DevOps\nNot only is observability critical for DevOps, but also for the entire organization. Replacing the static data of legacy monitoring solutions, [observability](/direction/monitor/platform-insights/) provides a full-spectrum view of application infrastructure.\n\nDevOps teams should be working with stakeholders to share observability metrics in a way that benefits the entire organization, as well as take steps to improve the implementation. Learning, and then evangelizing the benefits of app instrumentation to development teams can make observability even more effective. DevOps teams can also help identify the root cause of production incidents faster; well-instrumented application code makes it easy to distinguish from infrastructure issues. Finally, shifting observability left along the CI/CD pipeline means potential service-level objective (SLO) deltas are caught before they reach production.\n\nDevOps teams looking to provide meaningful improvements to application performance and business outcomes can look to observability as a way to deliver both.\n\n**Watch now: Senior Developer Evangelist Michael Friedrich digs deeper into the shift from monitoring to observability:**\n\n\u003C!-- blank line -->\n\u003Cfigure class=\"video_container\">\n  \u003Ciframe src=\"https://www.youtube.com/embed/BkREMg8adaI\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\u003C!-- blank line -->\n",[707,925,9],{"slug":1490,"featured":6,"template":688},"observability-vs-monitoring-in-devops","content:en-us:blog:observability-vs-monitoring-in-devops.yml","Observability Vs Monitoring In Devops","en-us/blog/observability-vs-monitoring-in-devops.yml","en-us/blog/observability-vs-monitoring-in-devops",{"_path":1496,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1497,"content":1503,"config":1510,"_id":1512,"_type":13,"title":1513,"_source":15,"_file":1514,"_stem":1515,"_extension":18},"/en-us/blog/optimize-gitops-workflow",{"title":1498,"description":1499,"ogTitle":1498,"ogDescription":1499,"noIndex":6,"ogImage":1500,"ogUrl":1501,"ogSiteName":672,"ogType":673,"canonicalUrls":1501,"schema":1502},"Optimize GitOps workflow with version control from GitLab","A GitOps workflow improves development, operations and business processes and GitLab’s CI plays a vital role.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749673081/Blog/Hero%20Images/gitops-image-unsplash.jpg","https://about.gitlab.com/blog/optimize-gitops-workflow","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Optimize GitOps workflow with version control from GitLab\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Brein Matturro\"}],\n        \"datePublished\": \"2019-10-28\",\n      }",{"title":1498,"description":1499,"authors":1504,"heroImage":1500,"date":1505,"body":1506,"category":1507,"tags":1508},[703],"2019-10-28","\nGitOps is a way for IT operations to manage changes across infrastructure and development teams. At GitLab\nConnect in Denver, [Tyler Sparks](https://www.linkedin.com/in/sparksconcept/), principal engineer and\nowner of Sparks Concept, presented a talk on why GitOps is a productive workflow and how\nusing GitLab can increase communication and version control.\n\n[GitOps](/topics/gitops/) uses infrastructure as code but with processes in place on top of it, including extensive use of\nmerge requests for everything from policy to infrastructure changes. “Success for most companies and\nengineering groups is based on the interactions of a large, complex, distributed system,” Tyler says.\nThe goal of GitOps is to incorporate Git beyond development and operations teams, improving the\nbusiness as a whole with the right tool. “It's a really cool way that GitLab integrates and it's a way to\nshift things left in your organization.”\n\n## The Git in GitOps\n\n“Git is the single source of truth. You shouldn’t be able to make any change outside of Git,” Tyler says. This creates one clean transaction between teams. Git establishes a unified location for anything from security, infrastructure changes, deployments, process changes, and even the integration of other tools. “Git is serving as the glue to make these safe transitions so that you can move faster as a team,” Tyler says.\n\nCreating that interaction between groups is often elaborate and difficult to manage. “Anyone building software these days is finding it more and more complex...everything is changing, the landscape is constantly changing,” Tyler says. Services are being run on stacks upon stacks and there is a lot of risk involved in maintenance. A tool, like [GitLab CI](/solutions/continuous-integration/), simplifies the processes and grants visibility.\n\n## GitOps best practices\n\nIn a GitOps workflow, where one simple change can impact three different teams, a strong [version control is imperative for communication](/topics/version-control/). Between disparate tools and poorly defined handoffs, the solution is to move into one repository for all tools and teams. With one overarching repository, “You can have a bunch of parallel workstreams running safely… you will have minimum viable change and a way to observe it,” Tyler says.\n\nWith GitLab’s version control system in place, teams can see what’s going on to work together and to know what change is going to impact where. “GitLab CI is one of the original products that made it possible to start to take an integrative view of the system,” Tyler says. “This is the penultimate way to [promote collaboration](/topics/gitops/gitops-gitlab-collaboration/) and to break down silos within an organization. GitLab is a tool that helps with that.”\n\nGitLab’s version control not only safeguards the infrastructure, but ultimately trickles throughout the entire enterprise. “As companies adopt GitLab, they’re not just more successful with their technology...it really comes down to how they’re functioning as a group,” Tyler says. “GitLab encourages some really good practices around development and how teams interact.”\n\n>“That’s why GitLab is the clear winner...They’re not just leading Gartner and Forrester because they paid somebody off. They’re actually an amazing tool.” Tyler Sparks, principal engineer and owner of Sparks Concept\n\nLearn more about GitOps best practices and Tyler’s work with GitLab CI in his presentation below:\n\n\u003Cfigure class=\"video_container\">\n  \u003Ciframe src=\"https://www.youtube.com/embed/5ykRuaZvY-E\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\nCover image by [David Rangel](https://unsplash.com/@rangel) on [Unsplash](https://unsplash.com)\n{: .note}\n","open-source",[757,9,230,1509],"user stories",{"slug":1511,"featured":6,"template":688},"optimize-gitops-workflow","content:en-us:blog:optimize-gitops-workflow.yml","Optimize Gitops Workflow","en-us/blog/optimize-gitops-workflow.yml","en-us/blog/optimize-gitops-workflow",{"_path":1517,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1518,"content":1525,"config":1531,"_id":1534,"_type":13,"title":1535,"_source":15,"_file":1536,"_stem":1537,"_extension":18},"/en-us/blog/overcome-ai-sprawl-with-a-value-stream-management-approach",{"title":1519,"description":1520,"ogTitle":1519,"ogDescription":1520,"config":1521,"ogImage":1522,"ogUrl":1523,"ogSiteName":672,"ogType":673,"canonicalUrls":1523,"schema":1524},"Overcome AI sprawl with a Value Stream Management approach","From The Source: Learn how an AI strategy based on Value Stream Management can stop AI sprawl and supply chain constraints and drive ROI.",{"noIndex":90},"https://res.cloudinary.com/about-gitlab-com/image/upload/v1749665000/Blog/Hero%20Images/display-the-source-article-overcome-ai-sprawl-image-0492-1800x945-fy25.png","https://about.gitlab.com/blog/overcome-ai-sprawl-with-a-value-stream-management-approach","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Overcome AI sprawl with a Value Stream Management approach\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Stephen Walters\"}],\n        \"datePublished\": \"2025-01-06\",\n      }",{"title":1519,"description":1520,"authors":1526,"heroImage":1522,"date":1528,"body":1529,"category":730,"tags":1530},[1527],"Stephen Walters","2025-01-06","This is a cross-over post about [overcoming AI sprawl with a Value Stream Management approach](https://about.gitlab.com/the-source/ai/overcome-ai-sprawl-with-a-value-stream-management-approach/).",[758,732,9],{"slug":1532,"featured":6,"template":688,"externalUrl":1533},"overcome-ai-sprawl-with-a-value-stream-management-approach","https://about.gitlab.com/the-source/ai/overcome-ai-sprawl-with-a-value-stream-management-approach/","content:en-us:blog:overcome-ai-sprawl-with-a-value-stream-management-approach.yml","Overcome Ai Sprawl With A Value Stream Management Approach","en-us/blog/overcome-ai-sprawl-with-a-value-stream-management-approach.yml","en-us/blog/overcome-ai-sprawl-with-a-value-stream-management-approach",{"_path":1539,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1540,"content":1546,"config":1551,"_id":1553,"_type":13,"title":1554,"_source":15,"_file":1555,"_stem":1556,"_extension":18},"/en-us/blog/partial-clone-for-massive-repositories",{"title":1541,"description":1542,"ogTitle":1541,"ogDescription":1542,"noIndex":6,"ogImage":1543,"ogUrl":1544,"ogSiteName":672,"ogType":673,"canonicalUrls":1544,"schema":1545},"How Git Partial Clone lets you fetch only the large file you need","Work faster with this experimental Partial Clone feature for huge Git repositories, saving you time, bandwidth, and storage, one large file at a time.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749681131/Blog/Hero%20Images/partial-clone-for-massive-repositories.jpg","https://about.gitlab.com/blog/partial-clone-for-massive-repositories","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How Git Partial Clone lets you fetch only the large file you need\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"James Ramsay\"}],\n        \"datePublished\": \"2020-03-13\",\n      }",{"title":1541,"description":1542,"authors":1547,"heroImage":1543,"date":1548,"body":1549,"category":1507,"tags":1550},[900],"2020-03-13","\n\nThe Git project began nearly 15 years ago, on [April 7,\n2005](https://marc.info/?l=linux-kernel&m=111288700902396), and is now the\n[version control system](/topics/version-control/) of choice for developers. Yet, there are certain types of projects that\noften do not use Git, particularly projects that have many large binary files,\nsuch as video games. One reason projects with large binary files don't use Git\nis because, when a Git repository is cloned, Git will download every version of\nevery file in the repo. For most use cases, downloading this history is a\nuseful feature, but it slows cloning and fetching for projects with large binary\nfiles, assuming the project even fits on your computer.\n\n## What is Partial Clone?\n\nPartial Clone is a new feature of Git that replaces [Git\nLFS](https://git-lfs.github.com/) and makes working with very large repositories\nbetter by teaching Git how to work without downloading every file. Partial Clone\nhas been\n[years](https://public-inbox.org/git/xmqqeg4o27zw.fsf@gitster.mtv.corp.google.com/)\nin the making, with code contributions from GitLab, GitHub, Microsoft and\nGoogle. Today it is experimentally available in Git and GitLab, and can be\nenabled by administrators\n([docs](https://docs.gitlab.com/ee/topics/git/partial_clone.html)).\n\nPartial Clone speeds up fetching and cloning because less data is\ntransferred, and reduces disk usage on your local computer. For example, cloning\n[`gitlab-com/www-gitlab-com`](https://gitlab.com/gitlab-com/www-gitlab-com)\nusing Partial Clone (`--filter=blob:none`) is at least 50% faster, and transfers\n70% less data.\n\nNote: Partial Clone is one specific performance optimization for very large\nrepositories. [Sparse\nCheckout](https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/)\nis a related optimization that is particularly focused on repositories with\ntremendously large numbers of files and revisions such as\n[Windows](https://devblogs.microsoft.com/bharry/the-largest-git-repo-on-the-planet/)\ncode base.\n\n## A brief history of large files\n\n\"What about Git LFS?\" you may ask. Doesn't LFS stand for \"large file storage\"?\n\nPreviously, extra tools were required to store large files in Git. In 2010,\n[git-annex](https://git-annex.branchable.com/) was released, and five years\nlater in 2015, [Git LFS](https://git-lfs.github.com/) was released. Both\ngit-annex and Git LFS added large file support to Git in a similar way: Instead\nof storing a large file in Git, store a pointer file that links to the large\nfile. Then, when someone needs a large file, they can download it on-demand\nusing the pointer.\n\nThe criticism of this approach is that there are now two places to store files,\nin Git or in Git LFS. Which means that everyone must remember that big files need\nto go in Git LFS to keep the Git repo small and fast. There are downsides to\nthis approach. Besides being susceptible to human error, the pointer encodes\ndecisions based on bandwidth and file type into the structure of the repository\nthat influence all the people using the repository. Our assumptions about\nbandwidth and storage are likely to change over time, and vary by the location,\nbut decisions encoded in the repository are not flexible. Administrators and\ndevelopers alike benefit from flexibility in where to store large files, and\nwhich files to download.\n\nPartial Clone solves these problems by removing the need for two classes of\nstorage, and special pointers. Let's walk through an example to understand how.\n\n## How to get started with Partial Clone\n\nLet's continue to use `gitlab-com/www-gitlab-com` as an example project, since\nit has quite a lot of images. For a larger repository, like a video game with\ndetailed textures and models that could take up a lot of disk space, the benefits will be even more significant.\n\nInstead of a vanilla `git clone`, we will include a filter spec which controls\nwhat is excluded when fetching data. In this situation, we just want to exclude\nlarge binary files. I've included `--no-checkout` so we can more clearly observe\nwhat is happening.\n\n```bash\ngit clone --filter=blob:none --no-checkout git@gitlab.com/gitlab-com/www-gitlab-com.git\n# Cloning into 'www-gitlab-com'...\n# remote: Enumerating objects: 624541, done.\n# remote: Counting objects: 100% (624541/624541), done.\n# remote: Compressing objects: 100% (151886/151886), done.\n# remote: Total 624541 (delta 432983), reused 622339 (delta 430843), pack-reused 0\n# Receiving objects: 100% (624541/624541), 74.61 MiB | 8.14 MiB/s, done.\n# Resolving deltas: 100% (432983/432983), done.\n\n```\n\nAbove we explicitly told Git not to checkout the default branch. Normally\n`checkout` doesn't require fetching any data from the server, because we have\neverything locally. When using Partial Clone, since we are deliberately not downloading everything, Git will need to fetch any missing files when doing a\ncheckout.\n\n```bash\ngit checkout master\n# remote: Enumerating objects: 12080, done.\n# remote: Counting objects: 100% (12080/12080), done.\n# remote: Compressing objects: 100% (11640/11640), done.\n# remote: Total 12080 (delta 442), reused 9773 (delta 409), pack-reused 0\n# Receiving objects: 100% (12080/12080), 1.10 GiB | 8.49 MiB/s, done.\n# Resolving deltas: 100% (442/442), done.\n# Updating files: 100% (12342/12342), done.\n# Filtering content: 100% (3/3), 131.24 MiB | 4.73 MiB/s, done.\n```\n\nIf we checkout a different branch or commit, we'll need to download more missing\nfiles.\n\n```bash\ngit checkout 92d1f39b60f957d0bc3c5621bb3e17a3984bdf72\n# remote: Enumerating objects: 1968, done.\n# remote: Counting objects: 100% (1968/1968), done.\n# remote: Compressing objects: 100% (1953/1953), done.\n# remote: Total 1968 (delta 23), reused 1623 (delta 15), pack-reused 0\n# Receiving objects: 100% (1968/1968), 327.44 MiB | 8.83 MiB/s, done.\n# Resolving deltas: 100% (23/23), done.\n# Updating files: 100% (2255/2255), done.\n# Note: switching to '92d1f39b60f957d0bc3c5621bb3e17a3984bdf72'.\n```\n\nGit remembers the filter spec we provided when cloning the repository so that\nfetching updates will also exclude large files until we need them.\n\n```bash\ngit config remote.origin.promisor\n# true\n\ngit config remote.origin.partialclonefilter\n# blob:none\n```\n\nWhen committing changes, you simply commit binary files like you would any other\nfile. There is no extra tool to install or configure, no need to treat big files\ndifferently to small files.\n\n## Network and Storage\n\nIf you are already using [Git LFS](https://git-lfs.github.com/) today, you might\nbe aware that large files are stored and transferred differently to regular Git\nobjects. On GitLab.com, Git LFS objects are stored in object storage (like AWS\nS3) rather than fast attached storage (like SSD), and transferred over HTTP even\nwhen using SSH for regular Git objects. Using object storage has the advantage\nof reducing storage costs for large binary files, while using simpler HTTP\nrequests for large downloads allows the possibility of resumable and parallel\ndownloads.\n\nPartial Clone\n[already](https://public-inbox.org/git/20190625134039.21707-1-chriscool@tuxfamily.org/)\nsupports more than one remote, and work is underway to allow large files to be\nstored in a different location such as object storage. Unlike Git LFS, however,\nthe repository or instance administrator will be able to choose which objects\nshould be stored where, and change this configuration over time if needed.\n\nFollow the epic for [improved large file\nstorage](https://gitlab.com/groups/gitlab-org/-/epics/1487) to learn more and\nfollow our progress.\n\n## Performance\n\nWhen fetching new objects from the Git server using a [filter\nspec](https://github.com/git/git/blob/v2.25.0/Documentation/rev-list-options.txt#L735)\n to exclude objects from the response, Git will check each object and exclude\n any that match the filter spec. In [Git\n 2.25](https://raw.githubusercontent.com/git/git/master/Documentation/RelNotes/2.25.0.txt),\n the most recent version, filtering has not been optimized for performance.\n\n[Jeff King (Peff)](https://github.com/peff/) (GitHub) recently\n[contributed](https://public-inbox.org/git/20200214182147.GA654525@coredump.intra.peff.net/)\nperformance improvements for blob size filtering, which will likely be included\nin [Git 2.26](https://gitlab.com/gitlab-org/gitaly/issues/2497), and our plan is\nto include it in GitLab 12.10 release.\n\nOptimizing the sparse filter spec option (`--filter:sparse`), which filters\nbased on file path is more complex because blobs, which contain the file\ncontent, do not include file path information. The directory structure of a\nrepository is stored in tree objects.\n\nFollow the epic for [Partial Clone performance\nimprovements](https://gitlab.com/groups/gitlab-org/-/epics/1671) to learn more\nand follow our progress.\n\n## Usability\n\nOne of the drawbacks of Git LFS was that it required installing an additional\ntool. In comparison, Partial Clone does not require any additional tools.\nHowever, it does require learning new options and configurations, such as to\nclone using the `--filter` option.\n\nWe want to make it easy for people get their work done, who simply desire Git to\njust work. They shouldn't need to work out which is the optimal blob size filter\nspec for a project? Or what even is a filter spec?  While Partial Clone remains\nexperimental, we haven't made any changes to the GitLab interface to highlight\nPartial Clone, but we are investigating this and welcome your feedback. Please\njoin the conversation on this\n[issue](https://gitlab.com/gitlab-org/gitlab/issues/207744).\n\n## File locking and tool integrations\n\nAny conversation of large binary files, particularly in regards to video\ngames is incomplete without discussing file locking and tooling integrations.\n\nUnlike plain text [source code](/solutions/source-code-management/), resolving conflicts between different versions of\na binary file is often impossible. To prevent conflicts in binary file editing,\nan exclusive file lock is used, meaning only one person at a time can edit a\nsingle file, regardless of branches. If conflicts can't be resolved, allowing multiple\nversions of a individual file to be created in parallel on different branches is a bug, not\na feature. GitLab already has basic file locking support, but it is really only\nuseful for plain text because it only applies to the default branch, and is not\nintegrated with any local tools.\n\nLocal tooling integrations are important for binary asset workflows, to\nautomatically propagate file locks to the local development environment, and to\nallow artists to work on assets without needing to use Git from the command\nline. Propagating file locks quickly to local development environments is also\nimportant because it prevents work from being wasted before it even happens.\n\nFollow the [file locking](https://gitlab.com/groups/gitlab-org/-/epics/1488) and\n[integrations](https://gitlab.com/groups/gitlab-org/-/epics/2704) epics for more\ninformation about what we're working on.\n\n## Conclusion\n\nLarge files are necessary for many projects, and Git will soon support this\nnatively, without the need for extra tools. Although Partial Clone is still an\nexperimental feature, we are making improvements with every release and the\nfeature is now ready for testing.\n\nThank you to the Git community for your work over the past years on improving\nsupport for enormous repositories. Particularly, thank you to [Jeff\nKing](https://github.com/peff/) (GitHub) and [Christian\nCouder](https://about.gitlab.com/company/team/#chriscool) (senior backend\nengineer on Gitaly at GitLab) for your early experimentation with Partial Clone,\nJonathan Tan (Google) and [Jeff Hostetler](https://github.com/jeffhostetler)\n(Microsoft) for contributing the [first\nimplementation](https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/)\nof Partial Clone and promisor remotes, and the many others who've also\ncontributed.\n\nIf you are already using Partial Clone, or would like to help us test Partial\nClone on a large project, please get in touch with me, [James\nRamsay](https://about.gitlab.com/company/team/#jramsay) (group manager, product\nfor Create at GitLab), [Jordi\nMon](https://about.gitlab.com/company/team/#jordi_mon) (senior product marketing\nmanager for Dev at GitLab), or your account manager.\n\nFor more information on Partial Clone, check out [the documentation](https://docs.gitlab.com/ee/topics/git/partial_clone.html).\n\nCover image by [Simon Boxus](https://unsplash.com/@simonlerouge) on\n[Unsplash](https://unsplash.com/photos/4ftI4lCcByM)\n{: .note}\n",[757,9],{"slug":1552,"featured":6,"template":688},"partial-clone-for-massive-repositories","content:en-us:blog:partial-clone-for-massive-repositories.yml","Partial Clone For Massive Repositories","en-us/blog/partial-clone-for-massive-repositories.yml","en-us/blog/partial-clone-for-massive-repositories",{"_path":1558,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1559,"content":1565,"config":1571,"_id":1573,"_type":13,"title":1574,"_source":15,"_file":1575,"_stem":1576,"_extension":18},"/en-us/blog/puma-nakayoshi-fork-and-compaction",{"title":1560,"description":1561,"ogTitle":1560,"ogDescription":1561,"noIndex":6,"ogImage":1562,"ogUrl":1563,"ogSiteName":672,"ogType":673,"canonicalUrls":1563,"schema":1564},"Ruby 2.7: Understand and debug problems with heap compaction","An overview of Ruby 2.7 heap compaction and the risks it adds to production Rails applications.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749669673/Blog/Hero%20Images/engineering.png","https://about.gitlab.com/blog/puma-nakayoshi-fork-and-compaction","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Ruby 2.7: Understand and debug problems with heap compaction\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Matthias Käppler\"}],\n        \"datePublished\": \"2021-04-28\",\n      }",{"title":1560,"description":1561,"authors":1566,"heroImage":1562,"date":1568,"body":1569,"category":681,"tags":1570},[1567],"Matthias Käppler","2021-04-28","\n\nThe GitLab Rails application runs on [Puma](https://puma.io/), a multi-threaded Rack application server written in the new Ruby.\nWe recently updated Puma to major version 5, which introduced [a number of important\nchanges](https://github.com/puma/puma/blob/master/History.md#500--2020-09-17),\nincluding support for _compaction_, a technique to reduce memory fragmentation in the\nRuby heap.\n\nIn this post we will describe what Puma's \"nakayoshi fork\" does, what compaction is,\nand some of the challenges we faced when first deploying it.\n\n## Nakayoshi: A friendlier `fork`\n\nPuma 5 added a new configuration switch: `nakayoshi_fork`. This switch affects Puma's behavior when\nforking new workers from the primary process. It is largely based on a [Ruby gem of the same name](https://github.com/ko1/nakayoshi_fork)\nbut adds new functionality. More specifically, enabling `nakayoshi_fork` in Puma will result in two additional\nsteps prior to forking into new workers:\n\n1. **Tenuring objects.** By running several minor garbage collection cycles ahead of a `fork`, Ruby can promote survivors\n   from the young to the old generation (referred to as \"tenuring\"). These objects are often classes, modules, or long-lived\n   constants that are unlikely to change.\n   This process makes forking copy-on-write friendly because tagging an object as \"old\" implies a write\n   to the underlying heap page. Doing this prior to forking means the OS won't have\n   to copy this page from the parent to the worker process later. We won't be discussing copy-on-write in detail but\n   [this blog post offers a good introduction to the topic and how it relates to Ruby and pre-fork servers](https://brandur.org/ruby-memory).\n\n1. **Heap compaction.** Ruby 2.7 added a new method `GC.compact`, which\n   will reorganize the Ruby heap to pack objects closer together when invoked. `GC.compact` reduces Ruby heap fragmentation and\n   potentially frees up Ruby heap pages so that the physical memory consumed can be reclaimed by the OS.\n   This step only happens when `GC.compact` is available in the version of Ruby that is in use (for MRI, 2.7 or newer).\n\nIn the remainder of this post, we will look at:\n\n* How `GC.compact` works and its potential benefits.\n* Why using C-extensions can be problematic when using compaction.\n* How we resolved a production incident that crashed GitLab.\n* What to look out for before enabling compaction in your app, via `nakayoshi_fork` or otherwise.\n\n## How compacting garbage collection works\n\nThe primary goal of a compacting garbage collector (GC) is to use allocated memory more\neffectively, which increases the likelihood of the application using less memory over time.\nCompaction is especially important when processes can share memory, as is the case with Ruby pre-fork\nservers such as Puma or Unicorn. But how does Ruby accomplish this?\n\nRuby manages its own object heap by allocating chunks of memory from the operating system called pages\n(a confusing term since Ruby heap pages are distinct from the smaller memory pages managed by the OS itself).\nWhen an application asks to create a new object, Ruby will try to find a free object slot in one of these\npages and fill it. As objects are allocated and deallocated over the lifetime of the application,\nthis can lead to fragmentation, with pages being neither entirely full nor entirely empty. This is the\nprimary cause for Ruby's infamous runaway memory problem: Since the available space isn't optimally used,\npages will rarely be entirely empty and become \"tomb pages\" which means it is necessary for the pages to be empty for them to be deallocated.\n\nRuby 2.7 added a new method, `GC.compact`, which aims to address this problem by walking the entire\nRuby heap space and moving objects around to obtain tightly packed pages. This process will ideally make\nsome pages unused, and unused memory can be reclaimed by the OS. [Watch this video from RubyConf 2019](https://www.youtube.com/watch?v=H8iWLoarTZc) where Aaron Patterson, the author of this feature, gave a good introduction to compacting GC.\n\nCompaction is a fairly expensive task since Ruby needs to stop-the-world for a complete heap reorganization so\nits best to perform this task before forking a new worker process, which is why Puma 5 included this step when performing `nakayoshi_fork`. Moreover, running compaction before forking\ninto worker processes increases the chance of workers being able to share memory.\n\nWe were eager to enable this feature on GitLab to see if it would reduce memory consumption, but things didn't entirely go as planned.\n\n## Inside the incident\n\nAfter extensive testing via our automated performance test suite and in preproduction\nenvironments, we felt ready to explore compaction on production nodes. We kept a\n[detailed, public record of what happened\nduring this production incident](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3370), but the key details are summarized below:\n\n* The deployment passed the canary stage, meaning workers who had their heaps compacted were serving traffic\n  successfully at this point.\n* Sometime during the full fleet rollout, problems emerged: Error rates started spiking but not\n  across the entire fleet. This phenomenon is odd because errors tend to spread across all workers due to load balancing.\n* The error messages surfacing in Sentry were mysterious at best:\n  `ActionView::Template::Error\nuninitialized constant #\u003CClass:#GrapePathHelpers::DecoratedRoute:0x00007f95f10ea5b8>::UNDERSCORE`. Remember this error message for later.\n* We discovered the affected workers were segfaulting in [`hamlit`](https://github.com/k0kubun/hamlit),\n  a high-performance HAML compiler. Hamlit uses a C-extension to achieve better performance. The segfaulting and the fact\n  that we were rolling out an optimization that touches GC-internal structures was a tell-tale sign that\n  compaction was likely to be the cause.\n* We rolled back the change to quickly recover from the outage.\n\n## How we diagnosed the problem\n\nWe were disappointed by this setback and wanted to understand why the outage occurred. Fortunately,\nRuby provides detailed stack traces when crashing in C-extensions. The most effective way\nto quickly analyze these is to look for transitions where a C-extension calls into the Ruby VM\nor vice versa. These lines therefore caught our attention:\n\n```shell\n...\n/opt/gitlab/embedded/lib/libruby.so.2.7(sigsegv+0x52) [0x7f9601adb932] signal.c:946\n/lib/x86_64-linux-gnu/libc.so.6(0x7f960154c4c0) [0x7f960154c4c0]\n/opt/gitlab/embedded/lib/libruby.so.2.7(rb_id_table_lookup+0x1) [0x7f9601b15e11] id_table.c:227\n/opt/gitlab/embedded/lib/libruby.so.2.7(rb_const_lookup+0x1e) [0x7f9601b4861e] variable.c:3357\n/opt/gitlab/embedded/lib/libruby.so.2.7(rb_const_get+0x39) [0x7f9601b4a049] variable.c:2339\n# ^--- Ruby VM functions\n/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/hamlit-2.11.0/lib/hamlit/hamlit.so(str_underscore+0x16) [0x7f95ee3518f8] hamlit.c:17\n/opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/hamlit-2.11.0/lib/hamlit/hamlit.so(rb_hamlit_build_id) hamlit.c:100\n# ^-- hamlit C-extension\n...\n```\n\nThe topmost stack frame reveals the preceeding calls led to a segmentation fault (`SIGSEGV`).\nWe highlighted the lines where Hamlit calls back into Ruby: In a function called `str_underscore` which\nwas called by `rb_hamlit_build_id`. The `rb_*` prefix tells us that this is a C-function we can call from Ruby,\nand indeed it is used by [`Hamlit::AttributeBuilder`](https://github.com/k0kubun/hamlit/blob/master/lib/hamlit/attribute_builder.rb) to construct DOM `id`s.\n\nBut we still don't know why it is crashing. Next, we need to inspect what happens in `str_underscore`.\nWe can see that this function performs a constant lookup on `mAttributeBuilder` – searching\nfor a constant called `UNDERSCORE`. When following the breadcrumbs it turns out to simply be the string `\"_\"`.\nIt is this lookup that failed.\n\nWait -- `UNDERSCORE`? That sounds familiar. Recall the top-level error messages:\n\n```\nActionView::Template::Error\nuninitialized constant #\u003CClass:#GrapePathHelpers::DecoratedRoute:0x00007f95f10ea5b8>::UNDERSCORE\n```\n\nBut `GrapePathHelpers` is clearly not a Hamlit class. Hamlit is trying to look up its own `UNDERSCORE`\nconstant on a class in the [`grape`](https://github.com/ruby-grape/grape) gem, an entirely different library\nthat is not involved in HTML rendering at all and there is no such constant defined on Grape's\n`DecoratedRoute` class either.\n\nNow the penny dropped – remember how compaction moves around objects in Ruby's heap space? Classes in\nRuby are objects too, so `GC.compact` must have moved a Grape class into an object slot that was previously\noccupied by a Hamlit class object, but Hamlit's C-extension never saw it coming!\n\n## How we solved the problem\n\nTo be clear, what happened above should _not_ happen with a well-behaved C-extension. Compaction\nwas developed carefully with support for C-extensions that predate Ruby 2.7, so all\nexisting Ruby gems would continue to operate normally.\n\nSo what went wrong? When a C-extension allocates Ruby objects, it must _mark_ them for as long as\nthey are alive. A marked object will not be garbage collected and because the Ruby GC cannot reason about objects\noutside of its own purview (i.e., objects created from Ruby code), it needs to rely on C-extensions\nto correctly mark and unmark objects themselves.\n\nNow comes the twist: Marked objects can be moved during compaction and existing C-extensions\ncan't cope with an object they hold pointers to suddenly move into a different slot.\nTherefore, Ruby 2.7 does something clever: It \"pins\" objects allocated with the mark function that existed prior\nto Ruby 2.7, meaning the pinned objects are not allowed to move during compaction. For new code, it introduces\na special mark-but-don't-pin function that will also allow an object to move, giving gem authors the\nopportunity to make their libraries compaction-aware.\n\nHamlit does not implement compaction support, so this could only mean one thing:\nHamlit wasn't even properly marking those objects, otherwise Ruby 2.7\nwould have automatically pinned them so they wouldn't move during compaction.\nAfter [discussing an attempted fix we submitted](https://github.com/k0kubun/hamlit/pull/171) but without\na reliable way to reproduce the issue for everyone, the Hamlit author decided to sidestep the\nproblem by [resolving those constants statically instead](https://github.com/k0kubun/hamlit/pull/172)\nand marking each via `rb_gc_register_mark_object`.\nThis change landed in [Hamlit 2.14.2](https://github.com/k0kubun/hamlit/blob/master/CHANGELOG.md#2142---2021-01-21)\nwhich we confirmed resolves the issue.\n\n## The next steps\n\nIt is exciting to see that the Ruby community is making progress on making Ruby a more memory-efficient\nlanguage but we learned that we need to step carefully when introducing such wide-reaching changes to a large\napplication like GitLab. It is difficult to investigate and fix problems that crash the Ruby VM, which is more likely for\nany library that uses C-extensions.\n\nTwo particular action items we took away from this were:\n\n1. **More reliable detection of compaction-related issues in CI.** We're not going to sugar-coat this:\n   We detected the problem late. Our comprehensive test suite was passing, our QA and performance tests\n   on staging environments passed, and the problem didn't even show up in canary deployments. Ideally, we\n   would have caught this issue with our automated test suite. One way to test whether compaction causes problems\n   is by using `GC.verify_compaction_references` – this is a rather crude tool because it requires\n   keeping two copies of the Ruby heap, which can be prohibitively expensive in terms of memory use. We\n   have therefore not yet decided how to approach this.\n1. **Improve our ability to roll out system configuration gradually.** Puma is part of our core infrastructure,\n   since it sits in the path of every web request, which makes it especially risky to experiment with Puma\n   configuration. GitLab already supports [feature flags](https://docs.gitlab.com/ee/development/feature_flags/index.html)\n   to allow developers to roll out product changes gradually, but it presents us with a catch-22 when\n   making changes at the infrastructure level, because to query the state of a feature flag, the infrastructure\n   needs to already be up and running. It would be ideal to have a similar mechanism for system configuration, [which we are currently exploring](https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/154).\n\nWhile performance is a major focus for us at the moment it must not compromise availability.\nWe will continue to monitor developments in the Ruby community around compaction support, but decided to\nnot use it in production at this point in time since the gains don't appear to outweigh the risks.\n",[864,9,754],{"slug":1572,"featured":6,"template":688},"puma-nakayoshi-fork-and-compaction","content:en-us:blog:puma-nakayoshi-fork-and-compaction.yml","Puma Nakayoshi Fork And Compaction","en-us/blog/puma-nakayoshi-fork-and-compaction.yml","en-us/blog/puma-nakayoshi-fork-and-compaction",{"_path":1578,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1579,"content":1585,"config":1591,"_id":1593,"_type":13,"title":1594,"_source":15,"_file":1595,"_stem":1596,"_extension":18},"/en-us/blog/rearchitecting-git-object-database-mainentance-for-scale",{"title":1580,"description":1581,"ogTitle":1580,"ogDescription":1581,"noIndex":6,"ogImage":1582,"ogUrl":1583,"ogSiteName":672,"ogType":673,"canonicalUrls":1583,"schema":1584},"Why and how we rearchitected Git object database maintenance for scale","Go in-depth into improvements to maintenance of the Git object database for reduced overhead and increased efficiency.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749664413/Blog/Hero%20Images/speedlights.png","https://about.gitlab.com/blog/rearchitecting-git-object-database-mainentance-for-scale","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Why and how we rearchitected Git object database maintenance for scale\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Patrick Steinhardt\"}],\n        \"datePublished\": \"2023-11-02\",\n      }",{"title":1580,"description":1581,"authors":1586,"heroImage":1582,"date":1588,"body":1589,"category":681,"tags":1590},[1587],"Patrick Steinhardt","2023-11-02","\n[Gitaly](/direction/gitaly/#gitaly-1), the service that is responsible for providing access to Git repositories in GitLab, needs to ensure that the repositories are maintained regularly. Regular maintenance ensures:\n\n- fast access to these repostiories for users\n- reduced resource usage for servers\n\nHowever, repository maintenance is quite expensive by itself and especially so for large monorepos.\n\nIn [a past blog post](/blog/scaling-repository-maintenance/), we discussed how we revamped the foundations of repository maintenance so that we can iterate on the exact maintenance strategy more readily. This blog post will go through improved maintenance strategies for objects hosted in a Git repository, which was enabled by that groundwork.\n\n- [The object database](#the-object-database)\n- [The old way of packing objects](#the-old-way-of-packing-objects)\n- [All-into-one repacks](#all-into-one-repacks)\n- [Deletion of unreachable objects](#deletion-of-unreachable-objects)\n- [Reachability checks](#reachability-checks)\n- [The new way of packing objects](#the-new-way-of-packing-objects)\n- [Cruft packs](#cruft-packs)\n- [More efficient incremental repacks](#more-efficient-incremental-repacks)\n- [Geometric repacking](#geometric-repacking)\n- [Real-world results](#real-world-results)\n\n## The object database\n\nWhenever a user makes changes in a Git repository, these changes come in the form of new objects written into the repository. Typically, any such object is written into the repository as a so-called \"loose object,\" which is a separate file that contains the compressed contents of the object itself with a header that identifies the type of the object.\n\nTo demonstrate this, in the following example we use\n[`git-hash-object(1)`](https://www.git-scm.com/docs/git-hash-object) to write a new blob into the repository:\n\n```shell\n $ git init --bare repository.git\nInitialized empty Git repository in /tmp/repository.git/\n $ cd repository.git/\n $ echo \"contents\" | git hash-object -w --stdin\n12f00e90b6ef79117ce6e650416b8cf517099b78\n $ tree objects\nobjects\n├── 12\n│   └── f00e90b6ef79117ce6e650416b8cf517099b78\n├── info\n└── pack\n\n4 directories, 1 file\n```\n\nAs you can see, the new object was written into the repository and stored as a separate file in the objects database.\n\nOver time, many of these loose objects will accumulate in the repository. Larger repositories tend to have millions of objects, and storing all of them as separate files is going to be inefficient. To ensure that the repository can be served efficiently to our users and to keep the load on servers low, Git will regularly compress loose objects into packfiles. We can compress loose objects manually by using, for example, [`git-pack-objects(1)`](https://www.git-scm.com/docs/git-pack-objects):\n\n```shell\n $ git pack-objects --pack-loose-unreachable ./objects/pack/pack \u003C/dev/null\nEnumerating objects: 1, done.\nCounting objects: 100% (1/1), done.\nWriting objects: 100% (1/1), done.\nTotal 1 (delta 0), reused 0 (delta 0), pack-reused 0\n7ce39d49d7ddbbbbea66ac3d5134e6089210feef\n $ tree objects\n objects/\n├── 12\n│   └── f00e90b6ef79117ce6e650416b8cf517099b78\n├── info\n│   └── packs\n└── pack\n    ├── pack-7ce39d49d7ddbbbbea66ac3d5134e6089210feef.idx\n    └── pack-7ce39d49d7ddbbbbea66ac3d5134e6089210feef.pack\n```\n\nThe loose object was compressed into a packfile (`.pack`) with a packfile index (`.idx`) that is used to efficiently access objects in that packfile.\n\nHowever, the loose object still exists. To remove it, we can execute [`git-prune-packed(1)`](https://www.git-scm.com/docs/git-prune-packed) to delete all objects that have been packed already:\n\n```shell\n $ git prune-packed\n $ tree objects/\nobjects/\n├── info\n│   └── packs\n└── pack\n    ├── pack-7ce39d49d7ddbbbbea66ac3d5134e6089210feef.idx\n    └── pack-7ce39d49d7ddbbbbea66ac3d5134e6089210feef.pack\n```\n\nFor end users of Git, all of this happens automatically because Git calls `git gc --auto` regularly. This command uses heuristics to figure out what needs to be optimized and whether loose objects need to be compressed into packfiles. This command is unsuitable for the server side because:\n\n- The command does not scale well enough in its current form. The Git project must be more conservative about changing defaults because they support a lot of different use cases. Because we know about the specific needs that we have at GitLab, we can adopt new features that allow for more efficient maintenance more readily.\n- The command does not provide an easy way to observe what exactly it is doing, so we cannot provide meaningful metrics.\n- The command does not allow us to fully control all its exact inner workings and so is not flexible enough.\n\nTherefore, Gitaly uses its own maintenance strategy to maintain Git repositories, of which maintaining the object database is one part.\n\n## The old way of packing objects\n\nAny maintenance strategy to pack objects must ensure the following three things to keep a repository efficient and effective with disk space:\n\n- Loose objects must be compressed into packfiles.\n- Packfiles must be merged into larger packfiles.\n- Objects that are not reachable anymore must be deleted eventually.\n\nPrevious to GitLab 16.0, Gitaly used the following three heuristics to ensure that those three things happened:\n\n- If the number of packfiles in the repository exceeds a certain threshold, Gitaly rewrote all packfiles into a single new packfile. Any objects that were unreachable were put into loose files so that they could be deleted after a certain grace period.\n- If the number of loose objects exceeded a certain threshold, Gitaly compressed all reachable loose objects into a new packfile.\n- If the number of loose objects that are older than the grace period for object deletion exceeded a certain threshold, Gitaly deleted those objects.\n\nWhile these heuristics satisfy all three requirements, they have several downsides, especially in large monorepos that contain gigabytes of data.\n\n### All-into-one repacks\n\nFirst and foremost, the first heuristic requires us to do all-into-one repacks where all packfiles are regularly compressed into a single packfile. In Git repositories with high activity levels, we usually create lots of packfiles during normal operations. But because we need to limit the maximum number of packfiles in a repository, we need to regularly do these complete rewrites of all objects.\n\nUnfortunately, doing such an all-into-one repack can be prohibitively expensive in large monorepos. The repacks may allocate large amounts of memory and typically keep multiple CPU cores busy during the repack, which can require hours of time to complete.\n\nSo, ideally, we want to avoid these all-into-one repacks to the best extent possible.\n\n### Deletion of unreachable objects\n\nTo avoid certain race conditions, Gitaly and Git enforce a grace period before an unreachable object is eligible for deletion. This grace period is tracked using the access time of such an unreachable object: If the last access time of the object is earlier than the grace period, the unreachable object can be deleted.\n\nTo track the access time of a single object, the object must exist as a loose object. This means that all objects that are pending deletion will be evictedfrom any packfile they were previously part of and become loose objects.\n\nBecause the grace period we have in place for Gitaly is 14 days, large monorepos tend to grow a large number of such loose object that are pending deletion. This has two effects:\n\n- The number of loose objects overall grows, which makes object lookup less efficient.\n- Loose objects are stored a lot less efficiently than packed objects, which means that the disk space required for the objects that are pending deletion is signficantly higher than if those objects were stored in their packed form.\n\nIdeally, we would be able to store unreachable objects in packed format while still being able to store their last access times separately.\n\n### Reachability checks\n\nCompressing loose objects into a new packfile is done by using an incremental repack. Git will compute the reachability of all objects in the repository and then pack all loose objects that are reachable into a new packfile.\n\nTo determine reachability of an object, we have to perform a complete graph walk. Starting at all objects that are directly referenced, we walk down any links that those objects have to any other objects. Once we reach the root of the object graph, we have then split all objects into two sets, which are the reachable and unreachable objects.\n\nThis operation can be quite expensive and the larger the repository and the more objects it contains, the more expensive this computation gets. As mentioned above though, objects which are about to be deleted need to be stored\nas loose objects such that we can track their last access time. So if our incremental repack compressed all loose objects into a packfile regardless of their reachability, then this would impact our ability to track the grace\nperiod per object.\n\nThe ideal solution here would avoid doing reachability checks altogether while still being able to track the grace period of unreachable objects which are pending deletion individually.\n\n## The new way of packing objects\n\nOver the past two years, the Git project has shipped multiple mechanisms that allow us to address all of these painpoints we had with our old strategy. These new mechanisms come in two different forms:\n\n- Geometric repacking allows us to merge multiple packfiles without having to rewrite all packfiles into one. This feature was introduced in [Git v2.32.0](https://gitlab.com/gitlab-org/git/-/commit/2744383cbda9bbbe4219bd3532757ae6d28460e1).\n- Cruft packs allow us to store objects that are pending deletion in compressed format in a packfile. This feature was introduced in [Git v2.37.0](https://gitlab.com/gitlab-org/git/-/commit/a50036da1a39806a8ae1aba2e2f2fea6f7fb8e08).\n\nThe Gitaly team has reworked the object database maintenance strategy to make use of these new features.\n\n### Cruft packs\nPrevious to Git v2.37.0, pruning objects with a grace period required Git to first unpack packed objects into loose objects. We did this so that we can track the per-object access times for unreachable objects that are pending deletion as explained above. This is inefficient though as it potentially requires us to keep a lot of unreachable objects in loose format until they can be deleted after the grace period.\n\nWith Git v2.37.0, [git-repack(1)](https://www.git-scm.com/docs/git-repack) learned to write [cruft packs](https://git-scm.com/docs/cruft-packs). While a cruft pack looks just like a normal pack, it also has an accompanying\n`.mtimes` file:\n\n```shell\n$ tree objects/\nobjects/\n├── info\n│   └── packs\n└── pack\n    ├── pack-7ce39d49d7ddbbbbea66ac3d5134e6089210feef.idx\n    ├── pack-7ce39d49d7ddbbbbea66ac3d5134e6089210feef.mtimes\n    └── pack-7ce39d49d7ddbbbbea66ac3d5134e6089210feef.pack\n```\n\nThis file contains per-object timestamps that record when the object was last accessed. With this, we can continue to track per-object grace periods while storing the objects in a more efficient way compared to loose objects.\n\nIn Gitaly, we [started to make use of cruft packs](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/5454) in GitLab 15.10 and made the feature generally available in GitLab 15.11. Cruft packs allow us to store objects that are pending deletion more efficiently and with less impact on the overall performance of the repository.\n\n### More efficient incremental repacks\n\nCruft packs also let us fix the issue that we had to do reachability checks when doing incremental repacks.\n\nPreviously, we had to always ensure reachability when packing loose objects so that we don't pack objects that are pending deletion. But now that any such object would be stored as part of a cruft pack and not as a loose pack anymore, we can instead compress all loose files into a packfile. This change was [introduced into Gitaly](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/5660) with GitLab 16.0.\n\nIn an artificial benchmark with the Linux repository, compressing all loose objects into a packfile led to more than a 90-fold speedup, dropping from almost 13 seconds to 174 milliseconds.\n\n### Geometric repacking\n\nLast but not least, we still have the issue that we need to perform regular all-into-one repacks when we have too many packfiles in the repository.\n\nGit v2.32.0 introduced a new \"geometric\" repacking strategy for the [git-repack(1)](https://www.git-scm.com/docs/git-repack) command that will merge multiple packfiles into a single, larger packfile, that we can use to solve this issue.\n\nThis new \"geometric\" strategy tries to ensure that existing packfiles in the repository form a [geometric sequence](https://en.wikipedia.org/wiki/Geometric_progression) where each successive packfile contains at least `n` times as many objects as the preceding packfile. If the sequence isn't maintained, Git will determine a slice of packfiles that it must repack to maintain the sequence again. With this process, we can limit the number of packfiles that exist in the repository without having to repack all objects into a single packfile regularly.\n\nThe following figures demonstrate geometric repacking with a factor of two.\n\n1. We notice that the two smallest packfiles do not form a geometric sequence as they both contain two objects each.\n\n![Geometrically repacking packfiles, initial](https://about.gitlab.com/images/blogimages/2023-10-09-repository-scaling-odb-maintenance/geometric-repacking-1.png)\n\n1. We identify the smallest slice of packfiles that need to be repacked in order to restore the geometric sequence. Merging the smallest two packfiles would lead to a packfile with four objects. This would not be sufficient to restore the geometric sequence as the next-biggest packfile contains four objects, as well.\n\nInstead, we need to merge the smallest three packfiles into a new packfile that contains eight objects in total. As `8 × 2 ≤ 16` the geometric sequence is restored.\n\n![Geometrically repacking packfiles, combining](https://about.gitlab.com/images/blogimages/2023-10-09-repository-scaling-odb-maintenance/geometric-repacking-2.png)\n\n3. We merge those packfiles into a new packfile.\n\n![Geometrically repacking packfiles, final](https://about.gitlab.com/images/blogimages/2023-10-09-repository-scaling-odb-maintenance/geometric-repacking-3.png)\n\nOriginally, we introduced this new feature [into Gitaly in GitLab 15.11](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/5590).\n\nUnfortunately, we had to quickly revert this new mode. It turned out that the geometric strategy was not ready to handle Git repositories that had an alternate object database connected to them. Because we make use of this feature to [deduplicate objects across forks](https://docs.gitlab.com/ee/development/git_object_deduplication.html), the new repacking strategy led to problems.\n\nAs active contributors to the Git project, we set out to fix these limitations in git-repack(1) itself. This led to an [upstream patch series](http://public-inbox.org/git/a07ed50feeec4bfc3e9736bf493b9876896bcdd2.1680606445.git.ps@pks.im/T/#u) that fixed a bunch of limitations around alternate object directories when doing geometric repacks in Git that was then released with Git v2.41.\n\nWith these fixes upstream, we were then able to\n[reintroduce the change](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/5607) and [globally enable our new geometric repacking strategy](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/5745) with GitLab 16.0.\n\n## Real-world results\n\nAll of this is kind of dry and deeply technical. What about the real-world results?\n\nThe following graphs show the global time we spent repacking objects across all projects hosted on GitLab.com.\n\n![Time spent optimizing repositories globally](https://about.gitlab.com/images/blogimages/2023-10-09-repository-scaling-odb-maintenance/global-optimization.png)\n\nThe initial rollout was on April 26 and progressed until April 28. As you can see, there was first a significant increase in repacking time. But after the initial dust settles, we can see that globally the time we spent repacking repositories roughly decreased by almost 20%.\n\nIn the two weeks before we enabled the feature, during weekdays and at peak times we were usually spending around 2.6 days per 12 hours repacking. In the two weeks after the feature was enabled, we spent around 2.12 days per 12 hours\nrepacking objects.\n\nThis is a success by itself already, but the more important question is how it would impact large monorepos, which are significantly harder to keep well-maintained due to their sheer size. Fortunately, the effect of the new housekeeping strategy was a lot more significant here. The following graph shows the time we spent performing housekeeping tasks in our own `gitlab-org` and `gitlab-com` groups, which host some of the most active repositories that have caused issues in the past:\n\n![Time spent optimizing repositories in GitLab groups](https://about.gitlab.com/images/blogimages/2023-10-09-repository-scaling-odb-maintenance/gitlab-groups-optimization.png)\n\nIn summary, we have observed the following improvements:\n\n|                                                        | Before              | After                | Change |\n| ------------------------------------------------------ | ------------------- | -------------------- | ------ |\n| Global accumulated repacking time                      | ~5.2 hours/hour     | ~4.2 hours/hour      | -20%   |\n| Large repositories of gitlab-org and gitlab-com groups | ~0.7-1.0 hours/hour | 0.12-0.15 hours/hour | -80%   |\n\nWe have heard of other customers that saw similar improvements in highly active large monorepositories.\n\n## Manually enable geometric repacking\n\nWhile the new geometric repacking strategy has been default-enabled starting with GitLab 16.0, it was introduced with GitLab 15.11. If you want to use the\nnew geometric repacking mode, you can opt in by setting the\n`gitaly_geometric_repacking` feature flag. You can do so via the `gitlab-rails`\nconsole:\n\n```\nFeature.enable(:gitaly_geometric_repacking)\n```\n",[757,864,9,708],{"slug":1592,"featured":6,"template":688},"rearchitecting-git-object-database-mainentance-for-scale","content:en-us:blog:rearchitecting-git-object-database-mainentance-for-scale.yml","Rearchitecting Git Object Database Mainentance For Scale","en-us/blog/rearchitecting-git-object-database-mainentance-for-scale.yml","en-us/blog/rearchitecting-git-object-database-mainentance-for-scale",{"_path":1598,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1599,"content":1605,"config":1610,"_id":1612,"_type":13,"title":1613,"_source":15,"_file":1614,"_stem":1615,"_extension":18},"/en-us/blog/redbox-on-demand-delivers-with-gitlab",{"title":1600,"description":1601,"ogTitle":1600,"ogDescription":1601,"noIndex":6,"ogImage":1602,"ogUrl":1603,"ogSiteName":672,"ogType":673,"canonicalUrls":1603,"schema":1604},"Redbox delivers On Demand with GitLab","Redbox's Joel Vasallo and Nicholas Konieczko explain how they ‘deliver software like a fox’ with GitLab.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749673064/Blog/Hero%20Images/redbox-blog-jannes-glas-unsplash.jpg","https://about.gitlab.com/blog/redbox-on-demand-delivers-with-gitlab","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Redbox delivers On Demand with GitLab\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Brein Matturro\"}],\n        \"datePublished\": \"2019-10-01\",\n      }",{"title":1600,"description":1601,"authors":1606,"heroImage":1602,"date":1607,"body":1608,"category":1507,"tags":1609},[703],"2019-10-01","\nAt GitLab Connect Chicago, Redbox's [Joel Vasallo](https://www.linkedin.com/in/joelvasallo) and [Nicholas Konieczko](https://www.linkedin.com/in/nick-konieczko-42895354) presented a talk called “Delivering software like a fox.” Redbox, primarily known for providing movie and video game rentals via automated retail kiosks, has recently expanded to provide streaming services.\n\nRedbox On Demand is the company's newest streaming platform, built on .NET Core in containers on Linux in the cloud. The video retail company had a few goals in mind when building its latest platform. Joel, cloud DevOps manager, and Nicholas, mobile applications manager, share their three main objectives and how GitLab provides the tool that ensures success.\n\n## Goal #1: Modernize software development processes\n\nThe mobile and development teams wanted to be able to create the platform using the latest technology in order to provide the best product for the customer. “[There was] nothing wrong with the way they were done, but in the sense that the world has really come a long way from traditional Windows servers... in a data center running .NET frameworks and stuff like that, we really wanted to empower developers to use containers,” Joel says.\n\n**Outcome**: The mobile and development teams currently use GitLab CI, leveraging Fastlane. The power of GitLab and its ability to work along with other tools helped to modernize software development processes.\n\n## Goal #2: Speed up delivery to the cloud\n\nProviding an on-demand service means that the application has to actually be ready at the very moment of demand. Being new to the streaming arena, it was important for Redbox to move to the cloud. “We also wanted to leverage the power of the cloud and have the scaling perspective of the cloud. We wanted to be in the cloud, as everyone wants to be nowadays. We also wanted to ensure that our features go out the door faster because, in the streaming business, it's all about being first to market with your features,” Joel says.\n\n**Outcome**: The teams now use GitLab CI along with Spinnaker. “We wanted to increase software delivery and do what's best for the teams. I don't want to dictate what developers should do in their day-to-day workflow,” Joel says.\n\n## Goal #3: Empower developers to own their applications\n\nThe hope was that a developer would be able to deploy code to production at any time with a single click of a button. Developers would then have the ability to just write the code and a working tool will be able to pick up the errors. “Code goes out the door based on an approval process. Developers are able to own and operate their code, essentially,” Joel says.\n\n**Outcome**: The objective was achieved, according to Joel. “Ultimately, developers own their own apps. GitLab Enterprise allowed teams to own their own verticals as well as Spinnaker, which allows them to deploy it to whatever cloud provider that they so choose.”\n\nTo learn more about how GitLab helped the mobile and development teams achieve their platform goals (and more), watch the presentation below.\n\n\u003C!-- blank line -->\n\u003Cfigure class=\"video_container\">\n  \u003Ciframe src=\"https://www.youtube.com/embed/3eG8Muorafo\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\u003C!-- blank line -->\n\n## Key takeaways\n\n### Putting the version in version control\n\n“There was a disparate amount of Git and source control tools. Namely, we had an internal Git server, which... I don't know what it was actually running. We had GitHub.com. We had Team Foundation Server. We had Azure DevOps. So all this stuff... Teams were really all over the place. They all had their source code. Getting access to things was just a nightmare.\n\n“So what did we do? Let's get another version control system into the mix. We need a fifth one. So we picked GitLab. Honestly, GitLab was the most tried and true solution from our perspective. It has support for a few things, like on-prem, also in the cloud as well on the .com offering. But, more than that, at the end of the day it let developers control their namespace within a large organization.” – _Joel Vasallo_\n\n### GitLab works for mobile development\n\n“The mobile teams were the first to get to try out GitLab.com. It's simple. It's extremely easy to get started. There's a lot of documentation out there, all the things I love. It's very cost effective. We were able to get a free trial running, get repos open, test out different things, different features, to see if it could work for our teams.\" – _Nick Konieczko_\n\n### Yes, you can use Jenkins too\n\n“This is, honestly, one of the best things about GitLab, is they just want us to be successful. Batteries are included. There's a lot of great tools in there, such as GitLab CI, the GitLab Issue Board... and GitLab's Artifact Repository. It's built into the platform. GitLab's pipelines with the CI/CD process, all of this comes in. But if you don't want to use it, it'll adapt to your business model.\n\n“For example, my team uses Jenkins. We can still use GitLab. There's no blocking event where it says, ‘Oh, you're using Jenkins. You can't talk to us. Error. Blocked.’ No, we use Jira. We type ‘Jira, take us into GitLab’ all the time. We have JFrog Artifactory. We also use Spinnaker for our software delivery. Again, it transforms to what you need as a business, and that's the thing that I really appreciate, being on the DevOps side.” – _Joel Vasallo_\n\nCover image by [Jannes Glas](https://unsplash.com/@jannesglas) on [Unsplash](https://www.unsplash.com)\n{: .note}\n",[1509,9,757,230],{"slug":1611,"featured":6,"template":688},"redbox-on-demand-delivers-with-gitlab","content:en-us:blog:redbox-on-demand-delivers-with-gitlab.yml","Redbox On Demand Delivers With Gitlab","en-us/blog/redbox-on-demand-delivers-with-gitlab.yml","en-us/blog/redbox-on-demand-delivers-with-gitlab",{"_path":1617,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1618,"content":1624,"config":1630,"_id":1632,"_type":13,"title":1633,"_source":15,"_file":1634,"_stem":1635,"_extension":18},"/en-us/blog/scaling-down-how-we-prototyped-an-image-scaler-at-gitlab",{"title":1619,"description":1620,"ogTitle":1619,"ogDescription":1620,"noIndex":6,"ogImage":1621,"ogUrl":1622,"ogSiteName":672,"ogType":673,"canonicalUrls":1622,"schema":1623},"Scaling down: How we shrank image transfers by 93%","Our approach to delivering an image scaling solution to speed up GitLab site rendering","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749664102/Blog/Hero%20Images/gitlab-values-cover.png","https://about.gitlab.com/blog/scaling-down-how-we-prototyped-an-image-scaler-at-gitlab","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Scaling down: How we shrank image transfers by 93%\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Matthias Käppler\"}],\n        \"datePublished\": \"2020-11-02\",\n      }",{"title":1619,"description":1620,"authors":1625,"heroImage":1621,"date":1626,"body":1627,"category":1628,"tags":1629},[1567],"2020-11-02","\n\n{::options parse_block_html=\"true\" /}\n\n\n\nThe [Memory](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/application_performance/) team recently shipped an improvement to our image delivery functions\nthat drastically reduces the amount of data we serve to clients. Learn here how we went from knowing nothing about\n[Golang](https://golang.org/) and image scaling to a working on-the-fly image scaling solution built into\n[Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse).\n\n## Introduction\n\nImages are an integral part of GitLab. Whether it is user and project avatars, or images embedded in issues\nand comments, you will rarely load a GitLab page that does not include images in some way shape or form.\nWhat you may not be aware of is that despite most of these images appearing fairly small when presented\non the site, until recently we were always serving them in their original size.\nThis meant that if you would visit a merge request, then all user avatars that appeared merely as thumbnails\nin sidebars or comments would be delivered by the GitLab application in the same size they were uploaded in,\nleaving it to the browser rendering engine to scale them down as necessary. This meant serving\nmegabytes of image data in a single page load, just so the frontend would throw most of it away!\n\nWhile this approach was simple and served us well for a while, it had several major drawbacks:\n\n- **Perceived latency suffers.** The perceived latency is the time that passes between a user\n  requesting content, and that content actually becoming visible or being ready to engage with.\n  If the browser has to download several megabytes of image data, and then has to furthermore\n  scale down those images to fit the cells they are rendered into, the user experience unnecessarily suffers.\n- **Egress traffic cost.** On gitlab.com, we store all images in object storage, specifically GCS\n  (Google Cloud Storage). This means that our Rails app first needs to resolve an image entity to\n  a GCS bucket URL where the binary data resides, and have the client\n  download the image through that endpoint. This means that for every image served, we cause\n  traffic from GCS to the user that we have to pay for, and the more data we serve, the higher the cost.\n\nWe therefore took on the challenge to both improve rendering performance and reduce traffic costs\nby implementing [an image scaler that would downscale images](https://gitlab.com/groups/gitlab-org/-/epics/3822)\nto a requested size before delivering them to the client.\n\n### Phase 1: Understanding the problem\n\nThe first problem is always: understand the problem! What is the status quo exactly? How does it work?\nWhat is broken about it? What should we focus on?\n\nWe had a pretty good idea of the severity of the problem, since we regularly run performance tests\nthrough [sitespeed.io](https://www.sitepeed.io) that highlight performance problems on our site.\nIt had identified images sizes as one of the most severe issues:\n\n![sitespeed performance test](https://gitlab.com/groups/gitlab-org/-/uploads/a06d8bfde802995c577afca843be7e96/Bildschirmfoto_2020-07-15_um_11.45.44.png)\n\nTo better inform a possible solution, an essential step was to [collect enough data](https://gitlab.com/gitlab-org/gitlab/-/issues/227387)\nto help identify the areas we should focus on. Here are some of the highlights:\n\n- **Most images requested are avatars.** We looked at the distribution of requests for certain types of images.\n  We found that about 70% of them were for avatars, while the remaining 30% accounted for embedded images.\n  This suggested that any solution would have the biggest reach if we focused on avatars first. Within the\n  avatar cohort we found that about 62% are user avatars, 22% are project avatars, and 16% are group avatars,\n  which isn't surprising.\n- **Most avatars requested are PNGs or JPEGs.** We also looked at the distribution of image formats. This is partially\n  affected by our upload pipeline and how images are processed (for instance, we always crop user avatars and store them as PNGs)\n  but we were still surprised to see that both formats made up 99% of our avatars (PNGs 76%, JPEGs 23%). Not much\n  love for GIFs here!\n- **We serve 6GB of avatars in a typical hour.** Looking at a representative window of 1 hour of GitLab traffic, we saw\n  almost 6GB of data move over the wire, or 144GB a day. Based on experiments with downscaling a representative user avatar,\n  we estimated that we could reduce this to a mere 13GB a day on average, saving 130GB of bandwidth each day!\n\nThis was proof enough for us that there were significant gains to be made here. Our first intuition was: could this\nbe done by a CDN? Some modern CDNs like Cloudflare [already support image resizing](https://support.cloudflare.com/hc/en-us/articles/360028146432-Understanding-Cloudflare-Image-Resizing)\nin some of their plans. However, we had two major concerns about this:\n\n1. **Supporting our self-managed customers.** While gitlab.com is the largest GitLab deployment we know of, we have hundreds of thousands\n  of customers who run their own GitLab installation. If we were to only resize images that pass through a CDN in front of gitlab.com,\n  none of those customers would benefit from it.\n1. **Pricing woes.** While there are request budgets based on your CDN plan, we were worried about the operational cost this would\n  add for us and how to reliably predict it.\n\nWe therefore decided to look for a solution that would work for all GitLab users, and that would be more under\nour own control, which led us to phase 2: experimentation!\n\n### Phase 2: Experiments, experiments, experiments!\n\nA frequent challenge for [our team (Memory)](https://about.gitlab.com/handbook/engineering/development/enablement/data_stores/application_performance/)\nis that we need to venture into parts of GitLab's code base\nthat we are unfamiliar with, be it with the technology, the product area, or both. This was true in this\ncase as well. While some of us had some exposure to image scaling services, none of us had ever built or\nintegrated one.\n\nOur main goal in phase 2 was therefore to identify what the possible approaches to image scaling were,\nexplore them by researching existing solutions or even building proof-of-concepts (POCs), and grade\nthem based on our findings. The questions we asked ourselves along the way were:\n\n- **When should we scale?** Upfront during upload or on-the-fly when an image is requested?\n- **Who does the work?** Will it be a dedicated service? Can it happen asynchronously in Sidekiq?\n- **How complex is it?** Whether it's an existing service we integrate, or something we build ourselves,\n  does implementation or integration complexity justify its relatively simple function?\n- **How fast is it?** We shouldn't forget that we set out to solve a performance issue. Are we sure that\n  we are not making the server slower by the same amount of time we save in the client?\n\nWith this in mind, we identifed [multiple architectural approaches](https://gitlab.com/groups/gitlab-org/-/epics/3979) to consider,\neach with their own pros and cons. These issues also doubled as a form of [architectural decision log](https://github.com/joelparkerhenderson/architecture_decision_record#what-is-an-architecture-decision-record)\nso that decisions for or against an approach are recorded.\n\nThe major approaches we considered are outlined next.\n\n#### Static vs. dynamic scaling\n\nThere are two basic ways in which an image scaler can operate: it can either create thumbnails of\nan existing image ahead of time, e.g. during the original upload as a background job. Or it can\nperform that work on demand, every time an image is requested. To make a long story short: while\nit took a lot of back and forth, and even though we had [a working POC](https://gitlab.com/gitlab-org/gitlab/-/issues/232616),\nwe eventually discarded the idea of scaling statically, at\nleast for avatars. Even though [CarrierWave](https://github.com/carrierwaveuploader/carrierwave) (the Ruby uploader\nwe employ) has an integration\nwith MiniMagick and is able to perform that kind of work, it suffered from several issues:\n\n1. **Maintenance heavy.** Since image sizes may change over time, a strategy is needed to backfill sizes\n  that haven't been computed yet. This raised questions especially for self-managed customers where\n  we do not control the GitLab installation.\n1. **Statefulness.** Since thumbnails are created alongside the original image, it was unclear how to perform\n  cleanups should they become necessary, since CarrierWave does not store these as separate database\n  entities that we could easily query.\n1. **Complexity.** The POC we created turned out to be more complex than anticipated and felt like we\n  were shoehorning this feature onto existing code. This was exacerbated by the fact that at the time\n  we were running a very old version of CarrierWave that was already a maintenance liability, and upgrading it\n  would have added scope creep and delays to an already complex issue.\n1. **Flexibility.** The actual scaler implementation in CarrierWave is buried three layers down the Ruby dependency stack,\n  and it was difficult to replace the actual scaler binary (which would become a\n  problem when trying to secure this solution as we will see in a moment.)\n\nFor these reasons we decided to scale images on-the-fly instead.\n\n### Dynamic scaling: Workhorse vs. dedicated proxy\n\nWhen scaling images on-the-fly the question becomes: where? Early on there was a suggestion to use\n[imgproxy](https://github.com/imgproxy/imgproxy), a \"fast and secure standalone server for resizing and converting remote images\".\nThis sounded tempting, since it is a \"batteries included\" offering, it's free to use, and it is a great\nway to isolate the task of image scaling from other production work loads, which has benefits around\nsecurity and fault isolation.\n\nThe main problem with imgproxy was exactly that, however: a standalone server.\n[Introducing a new service to GitLab](https://docs.gitlab.com/ee/development/adding_service_component.html#adding-a-new-service-component-to-gitlab)\nis a complex task, since we strive to appear as a [single application](https://about.gitlab.com/handbook/product/single-application/) to the end user,\nand documenting, packaging, configuring, running and monitoring a new service just for rescaling images seemed excessive.\nIt therefore wasn't in line with our prerogative of focusing on the\n[minimum viable change](https://handbook.gitlab.com/handbook/product/product-principles/#the-minimal-viable-change-mvc).\nMoreover, imgproxy had significant overlap with existing architectural components at GitLab, since we already\nrun a reverse proxy: [Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse).\n\nWe therefore decided that the fastest way to deliver an MVC was to build out this functionality in Workhorse\nitself. Fortunately we found that we already had an established pattern for dealing with special, performance\nsensitive workloads, which meant that we could\nlearn from existing solutions for similar problems (such as image delivery from remote storage), and we could\nlean on its existing integration with the Rails application for request authentication and running business\nlogic such as validating user inputs, which helped us tremendously to focus on the actual problem: scaling images.\n\nThere was a final decision to make, however: scaling images is a very different kind of workload from\nserving ordinary requests, so an open question was how to integrate a scaler into Workhorse in a way\nthat would not have knock-on effects on other tasks Workhorse processes need to execute.\nThe two competing approaches discussed were to either shell out to an executable that performs the scaling,\nor run a [sidecar process](https://docs.microsoft.com/en-us/azure/architecture/patterns/sidecar#:~:text=Sidecars%20are%20supporting%20processes%20or,fate%20of%20its%20parent%20application.)\nthat would take over image scaling work loads from the main Workhorse process.\n\n### Dynamic scaling: Sidecar vs. fork-on-request\n\nThe main benefit of a sidecar process is that it has its own life-cycle and memory space, so it can be tuned\nseparately from the main serving process, which improves fault isolation. Moreover, you only pay the\ncost for starting the process once. However, it also comes with\nadditional overhead: if the sidecar dies, something has to restart it, so we would have to look at\nprocess supervisors such as `runit` to do this for us, which again comes with a significant amount\nof configuration overhead. Since at this point we weren't even sure how costly it would be to serve\nimage scaling requests, we let our MVC principle guide us and decided to first explore the simpler\nfork-on-request approach, which meant shelling out to a dedicated scaler binary on each image scaling\nrequest, and only consider a sidecar as a possible future iteration.\n\nForking on request was [explored as a POC](https://gitlab.com/gitlab-org/gitlab/-/issues/230519)\nfirst, and was quickly made production ready and deployed\nbehind a feature toggle. We initially ended up settling on [GraphicsMagick](http://www.graphicsmagick.org/)\nand its `gm` binary to perform the actual image scaling for us, both because it is a battle tested library, but also\nbecause there was precedent at GitLab to use it for existing features, which allowed us to ship\na solution even faster.\n\nThe overall request flow finally looked as follows:\n\n```mermaid\nsequenceDiagram\n    Client->>+Workhorse: GET /image?width=64\n    Workhorse->>+Rails: forward request\n    Rails->>+Rails: validate request\n    Rails->>+Rails: resolve image location\n    Rails-->>-Workhorse: Gitlab-Workhorse-Send-Data: send-scaled-image\n    Workhorse->>+Workhorse: invoke image scaler\n    Workhorse-->>-Client: 200 OK\n```\n\nThe \"secret sauce\" here is the `Gitlab-Workhorse-Send-Data` header synthesized by Rails. It carries\nall necessary parameters for Workhorse to act on the image, so that we can maintain a clean separation\nbetween application logic (Rails) and serving logic (Workhorse).\nWe were fairly happy with this solution in terms of simplicity and ease of maintenance, but we\nstill had to verify whether it met our expectations for performance and security.\n\n### Phase 3: Measuring and securing the solution\n\nDuring the entire development cycle, we frequently measured the performance of the various appraoches\nwe tested, so as to understand how they would affect request latency and memory use.\nFor latency tests we relied on [Apache Bench](https://httpd.apache.org/docs/2.4/programs/ab.html), since\nrecalling our initial mission, we were mostly interested in reducing the request latency a user might experience.\n\nWe also ran benchmarks encoded as Golang tests that specifically [compared different scaler implementations](https://gitlab.com/ayufan/image-resizing-test)\nand how performance changed with different image formats and image sizes. We learned a lot from these\ntests, especially where we would typically lose the most time, which often was in encoding/decoding\nan image, and not in resizing an image per se.\n\nWe also took security very seriously from the start. Some image formats such as SVGs are notorious\nfor remote code execution attacks, but there were other concerns such as DOS-ing the service with\ntoo many scaler requests or PNG compression bombs. We therefore\nput very strict requirements in place around what sizes (both dimensionally but also in bytes) and\nformats we will accept.\n\nUnfortunately one fairly severe issue remained that turned out to be a deal breaker with our simple\nsolution: `gm` is a complex piece of software, and shelling out to a 3rd party binary written in C still\nleaves the door open for a number of security issues. The decision was to [sandbox the binary](https://gitlab.com/groups/gitlab-org/-/epics/4373)\ninstead, but this turned out\nto be a lot more difficult than anticipated. We evaluated but discarded multiple approaches to sandboxing\nsuch as via `setuid`, `chroot` and `nsjail`, as well as building a custom binary on top of [seccomp](https://en.wikipedia.org/wiki/Seccomp).\nHowever, due to performance, complexity or other concerns we discarded all of them in the end.\nWe eventually decided to sarifice some performance for the sake of protecting our users as best we can and\nwrote a scaler binary in Golang, based on an existing [imaging](https://github.com/disintegration/imaging)\nlibrary, which had none of these issues.\n\n### Results, conclusion and outlook\n\nIn roughly two months we took an innocent sounding but in fact complex topic, image scaling, and went\nfrom \"we know nothing about this\" to a fully functional solution that is now running on gitlab.com.\nWe faced many headwinds along the way, in part because we were unfamiliar with both the topic and\nthe technology behind Workhorse (Golang), but also because we underestimated the challenges of delivering\nan image scaler that will be both fast and secure, an often difficult trade-off. A major lesson learned\nfor us is that security cannot be an afterthought; it has to be part of the design from day one and\nmust be part of informing the approach taken.\n\nSo was it a success? Yes! While the feature didn't have as much of an impact on overall perceived client\nlatency as we had hoped, we still dramatically improved a number of metrics. First and foremost, the\ndreaded \"properly size image\" reminder that topped our sitepeed metrics reports is resolved. This is also evident\nin the average image size processed by clients, which for image heavy pages fell off a cliff (that's good -- lower is\nbetter here):\n\n![image size metric](https://gitlab.com/groups/gitlab-org/-/uploads/b453aedaf2132db1292898508fd6a0c1/Bildschirmfoto_2020-10-06_um_07.02.56.png)\n\nSite-wide we saw a staggering **93% reduction** in image transfer size of page content delivered to clients.\nThese gains also translate into savings for GCS egress traffic, and hence Dollar cost savings, by an equivalent amount.\n\nA feature is never done of course, and there are a number of things we are looking to improve in the future:\n\n- Improving metrics and observability\n- Improving performance through more aggressive caching\n- Adding support for WebP and other features such as image blurring\n- Supporting content images embedded into GitLab issues and comments\n\nThe Memory team meanwhile will slowly step back from this work, however, and hand it over to product teams\nas product requirements evolve.\n","unfiltered",[732,1249,9],{"slug":1631,"featured":6,"template":688},"scaling-down-how-we-prototyped-an-image-scaler-at-gitlab","content:en-us:blog:scaling-down-how-we-prototyped-an-image-scaler-at-gitlab.yml","Scaling Down How We Prototyped An Image Scaler At Gitlab","en-us/blog/scaling-down-how-we-prototyped-an-image-scaler-at-gitlab.yml","en-us/blog/scaling-down-how-we-prototyped-an-image-scaler-at-gitlab",{"_path":1637,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1638,"content":1644,"config":1650,"_id":1652,"_type":13,"title":1653,"_source":15,"_file":1654,"_stem":1655,"_extension":18},"/en-us/blog/scaling-our-use-of-sidekiq",{"title":1639,"description":1640,"ogTitle":1639,"ogDescription":1640,"noIndex":6,"ogImage":1641,"ogUrl":1642,"ogSiteName":672,"ogType":673,"canonicalUrls":1642,"schema":1643},"How we scaled async workload processing at GitLab.com using Sidekiq","Sidekiq was a great tool for async processing until it couldn't keep up. Here's how we made it scale.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749667068/Blog/Hero%20Images/sidekiqmountain.jpg","https://about.gitlab.com/blog/scaling-our-use-of-sidekiq","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we scaled async workload processing at GitLab.com using Sidekiq\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Rachel Nienaber\"}],\n        \"datePublished\": \"2020-06-24\",\n      }",{"title":1639,"description":1640,"authors":1645,"heroImage":1641,"date":1647,"body":1648,"category":681,"tags":1649},[1646],"Rachel Nienaber","2020-06-24","## Sidekiq at GitLab\n\nGitLab is a Ruby-on-Rails application that processes a lot of data. Much of this processing can be done asynchronously,\nand one of the solutions we use to accomplish this is [Sidekiq](https://github.com/mperham/sidekiq/wiki) which is a background-processing\nframework for Ruby. It handles jobs that are better processed asynchronously outside the web request/response cycle.\n\nThere are a few terms that that we'll use in this post:\n\n* A **worker class** is a class defined in our application to process a task in Sidekiq.\n* A **job** is an instance of a worker class, so each job represents a single task.\n* A **queue** is a collection of jobs (potentially for different worker classes) that are waiting to be processed.\n* A **worker thread** is a thread processing jobs in particular queues. Each Sidekiq process can have multiple worker threads.\n\nThen there are two terms specific to GitLab.com:\n\n* A **Sidekiq role** is a configuration for a particular group of queues. For instance, we might have a `push_actions` role that is for processing the `post_receive` and `process_commit` queues.\n* A **Sidekiq node** is an instance of the GitLab application for a Sidekiq role. A Sidekiq node can have multiple Sidekiq processes.\n\nBack in 2013, in version 6.3 of GitLab, every Sidekiq worker class had its own queue. We weren't strict in monitoring the creation of\nnew worker classes. There was no strategic plan for assigning queues to where they would execute.\n\nIn 2016, we tried to introduce order again, and rearranged the queues to be based on features. We followed this with a change in\n2017 to have a dedicated queue for each worker class again, and we were able to monitor queues more accurately and impose specific\nthrottles and limits to each. It was easy to quickly make decisions about the queues as they were running because of how\nthe work was distributed. The queues were grouped, and the names of these groups were `realtime`, `asap`, and `besteffort` for example.\n\nAt the time, we knew that this was not the approach recommended by the author of Sidekiq, Mike Perham, but we felt that we knew what\nthe trade-offs were. In fact, Mike wrote: \n\n> “I don't recommend having more than a handful of queues. Lots of queues makes for a more complex\n> system [and Sidekiq Pro cannot reliably](https://github.com/antirez/redis/issues/1785) handle multiple queues without\n> polling. M Sidekiq Pro processes polling N queues means O(M*N) operations per second slamming Redis.”\n\nFrom [https://github.com/mperham/sidekiq/wiki/Advanced-Options#queues](https://github.com/mperham/sidekiq/wiki/Advanced-Options#queues)\n\nThis served us well for nearly two years before this approach no longer matched our scaling needs.\n\n### Pressure from availability issues\n\nIn mid-2019 GitLab.com experienced several different major incidents related to the way we\nprocess background queues.\n\nExamples of these incidents:\n- [Gitaly n+1 calls caused bad latency and resulted in the Sidekiq queues growing](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7479).\nThis was due to the way we processed tags in Gitaly.\n- A user generated many notes on a single commit which [slowed down the new_note Sidekiq queue](https://gitlab.com/gitlab-com/gl-infra/production/issues/1028)\nand led to a delay of sending out notifications.\n- CI jobs took very long to complete because [jobs in the pipeline_processing:pipeline_process Sidekiq queue piled up](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7402).\n2 pipelines caused a high amount of Sidekiq jobs, Sidekiq pipeline nodes were maxing out their CPU, pipeline_processing\njobs were causing many SQL calls and the pgbouncer pool for Sidekiq was becoming saturated.\n\nAll of these were showing that we needed to take action.\n\n![Sidekiq throughput per job](https://about.gitlab.com/images/blogimages/sidekiq_throughput_per_job.png){: .shadow}\n\nThis image shows how many jobs we process per second over a 24 hour period. This shows the variety of jobs and\ngives an idea of the scale of jobs in relation to each other.\n\n### Improvements\n\n#### Changing the relationship between jobs and Sidekiq roles\n\nIn [infrastructure#7219 (closed)](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219) we significantly\naltered our approach for how jobs were related to Sidekiq roles.\n\nWe started from a position where:\n1. We had a many-to-many relationship between Sidekiq jobs and Sidekiq roles.\n   1. For example, most pipeline jobs ran on the `besteffort` nodes, but some ran on the pipeline nodes.\n   1. Some jobs ran on up to three types of node: eg `realtime`, `asap` and `besteffort` priorities.\n1. Worker threads were reserved for single queues.\n   1. For example, one eighth of the `realtime` queue might be reserved for new_note jobs. In the event of a glut of\n  new_note jobs, most of the fleet would sit idle while one worker thread would be saturated. Worse, adding more nodes would\n  only increase processing power by 1/8th of a node, not the full compute capacity of the new node.\n1. Urgent and non-urgent jobs would be in the same queue.\n   1. For example, some jobs in the `realtime` queue would take up to 10 minutes to process.\n   1. This is a bit like allowing overloaded trolleys in the 10 items-or-less lane.\n\nOnce the issue was completed, we now had:\n1. A one-to-one relationship between Sidekiq jobs and Sidekiq roles\n   1. Each job will execute on exactly one Sidekiq role\n1. All worker threads will run all jobs, and each Sidekiq node will have the same number of worker threads\n   1. When a glut of jobs comes in, 100% of compute on a node can be dedicated to executing the jobs\n1. Slow jobs and fast jobs are kept apart\n   1. The 10 items or less lane is now being enforced.\n\nWhile this was a significant improvement, it introduced some technical debt. We fixed everything for a moment in time,\nknowing that as soon as the application changed this would be out of date, and as time went on, would only get more out\nof date until we were back in the same position. To try and mitigate this in future, we started to look at classifying\nthe workloads and using queue selectors.\n\n#### Queue Selectors Deployed in Sidekiq Cluster\n\nIn the\n[Background Processing Improvements Epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/96), we looked at ways\nthat we could simplify the structure so that background processing could be in a position to scale to 100x the traffic\nat the time. We also needed the processing to be unsurprising. Operators (and developers) should understand where a job\nwill run, why it is queueing up and how to reduce queues. We decided to move to using [queue selectors](https://docs.gitlab.com/ee/administration/sidekiq/extra_sidekiq_processes.html)\nto help us to keep the queue definitions correct. (This approach is still experimental).\n\nIn addition, the infrastructure team should not reactively (and manually) route Sidekiq jobs to priority fleets, as\nwas the situation previously. Developers should have the ability to specify the requirements of their workloads and\nhave these automatically processed on a queue designed to support that type of work.\n\nSidekiq processes can be configured to select specific queues for processing. Instead of making this selection by name,\nwe wanted to make the selection on how the workload for that queue was classified.\n\nWe came up with an approach for classifying background jobs by their workload and building a sustainable way of grouping\nsimilar workloads together.\n\nWhen a new job is created, developers need to do this to classify the workload. This is done through\n- Specifying the [urgency of the job](https://docs.gitlab.com/ee/development/sidekiq/index.html). The options\nare `high`, `low` and `none`. If the delay of a job would have user impact, then the job is `high` urgency.\n- Noting if the [job has external dependencies](https://docs.gitlab.com/ee/development/sidekiq/index.html)\nthat could impact their availability. (For example, if they communicate with user-specified Kubernetes clusters).\n- Adding an [annotation declaring if the worker class will be cpu-bound or memory-bound](https://docs.gitlab.com/ee/development/sidekiq/index.html). Knowing\nthis allows us to make decisions around how much thread concurrency a Ruby process can tolerate, or targeting memory-bound\njobs to low-concurrency, high-memory nodes.\n\nThere is additional guidance available to [determine if the worker class should be marked as cpu-bound](https://docs.gitlab.com/ee/development/sidekiq/index.html).\n\n#### SLAs are based on these attributes\n\n1. High urgency jobs should not queue for more than 10 seconds.\n1. High urgency jobs should not take more than 10 seconds to execute (this SLA is the responsibility of the owning team to ensure that high throughput is maintained).\n1. Low urgency jobs should not queue for more than 1 minute.\n1. Jobs without urgency have no queue SLA.\n1. Non-high urgency jobs should not take more than 5 minutes to execute.\n\nIn each case, the queuing SLAs are the responsibility of the infrastructure team, as they need to ensure that the fleet is\ncorrectly provisioned to meet the SLA.\n\nThe execution latency SLAs are the responsibility of the development team owning the worker class, as they need to ensure that the\nworker class is sufficiently performant to ensure throughput.\n\n![Sidekiq certain queues spike](https://about.gitlab.com/images/blogimages/sidekiq_authorized_projects.png){: .shadow}\n\nThis image shows the challenges we faced by having jobs of different urgency running on the same queue.\nThe purple lines show spikes from one particular worker, where many jobs were added to the queue,\ncausing delays to other jobs which were often of equal or higher importance.\n\n### Challenge during rollout - BRPOP\n\nAs the number of background queues in the GitLab application grows, this approach continues to burden our Sidekiq Redis\nservers. On GitLab.com, our `catchall` Sidekiq nodes monitor about 200 queues, and the Redis [BRPOP](https://redis.io/commands/brpop)\ncommands used to monitor the queues consume a significant amount of time (by Redis latency standards).\n\nThe number of clients listening made this problem worse. For `besteffort` we had 7 nodes, each running 8 processes,\nwith 15 threads watching those queues - meaning 840 clients.\n\nThe command causing the problem was BRPOP. The time taken to perform this command also relates\nto the number of listeners on those keys. The addition of multiple keys increases contention in the system which causes\nlots of connections to block. And when the key list is longer the problem gets worse. The keylist represents the number of\nqueues, the more queues we have, the more keys we are listening to. We saw this problem on the nodes that process the most queues.\n\nWe raised an issue in the Redis issue tracker about the [performance we observed when many clients performed BRPOP on the\nsame key](https://github.com/antirez/redis/issues/7071). It was fantastic when [Salvatore](https://github.com/antirez)\nresponded within the hour and the patch was available the same day!  This fix was made in Redis 6 and backported to Redis 5.\n[Omnibus has also been upgraded to use this fix](https://gitlab.com/gitlab-org/omnibus-gitlab/-/merge_requests/4126), and it will\nbe available in the major release 13.0.\n\n### Current State (as of June 2020)\n\nMigrating to these new selectors has been completed as of late April 2020.\n\nWe reduced our Sidekiq fleet from 49 nodes with 314 CPUs, to 26 nodes with 158 CPUs. This has also reduced our cost.\nThe average utilization is more evenly spread across the new fleets.\n\nAlso, we have [moved Sidekiq-cluster to Core](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/181). Previously, running\nSidekiq in clustered mode (i.e. spawning more than one process) was\ntechnically only available as part of GitLab EE distributions, and for self-managed environments only in the Starter+ tiers.\nBecause of that, when booting Sidekiq up in a development env with the GDK, the least common denominator was assumed,\nwhich was to run Sidekiq in a single-process setup. That can be a problem, because it means there is a divergence between\nthe environment developers work on, and what will actually run in production (i.e. gitlab.com and higher-tier self-managed environments).\n\nIn [release 13.0](/releases/2020/06/22/gitlab-13-1-released/) Sidekiq Cluster is used by default.\n\nWe’re also better placed to migrate to Kubernetes.  The selector approach is a lot more compatible with making good\ndecisions about things like CPU allocations + limits for Kubernetes workloads, and this will make the job of our delivery\nteam easier, leading to further cost reductions from auto-scaling deployed resources to match actual load.\n\nOur next piece of work with Sidekiq will be to [reduce the number of queues that we need to watch](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/194)\nand we will post a follow-up to this blog post when the work is completed.\n\n**Read more about infrastructure issues:**\n\n[Faster pipelines with DAG](/blog/directed-acyclic-graph/)\n\n[Keep Kubernetes runners moving](/blog/best-practices-for-kubernetes-runners/)\n\n[Understand parent-child pipelines](/blog/parent-child-pipelines/)\n\nCover image by [Jerry Zhang](https://unsplash.com/@z734923105) on [Unsplash](https://www.unsplash.com)\n{: .note}\n",[754,9,732],{"slug":1651,"featured":6,"template":688},"scaling-our-use-of-sidekiq","content:en-us:blog:scaling-our-use-of-sidekiq.yml","Scaling Our Use Of Sidekiq","en-us/blog/scaling-our-use-of-sidekiq.yml","en-us/blog/scaling-our-use-of-sidekiq",{"_path":1657,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1658,"content":1664,"config":1669,"_id":1671,"_type":13,"title":1672,"_source":15,"_file":1673,"_stem":1674,"_extension":18},"/en-us/blog/scaling-repository-maintenance",{"title":1659,"description":1660,"ogTitle":1659,"ogDescription":1660,"noIndex":6,"ogImage":1661,"ogUrl":1662,"ogSiteName":672,"ogType":673,"canonicalUrls":1662,"schema":1663},"Future-proofing Git repository maintenance","Learn how we revamped our architecture for faster iteration and more efficiently maintained repositories.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749677736/Blog/Hero%20Images/Git.png","https://about.gitlab.com/blog/scaling-repository-maintenance","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Future-proofing Git repository maintenance\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Patrick Steinhardt\"}],\n        \"datePublished\": \"2023-03-20\",\n      }",{"title":1659,"description":1660,"authors":1665,"heroImage":1661,"date":1666,"body":1667,"category":681,"tags":1668},[1587],"2023-03-20","\n\nUsers get the most from [Gitaly](/direction/gitaly/#gitaly-1), the service responsible for the storage and maintenance of all Git repositories in GitLab, when traffic hitting it is efficiently handled. Therefore, we must ensure our Git repositories remain in a well-optimized state. When it comes to Git monorepositories, this maintenance can be a complex task that can cause a lot of overhead by itself because repository housekeeping becomes more expensive the larger the repositories get. This blog post explains in depth what we have done over the past few GitLab releases to rework our approach to repository housekeeping for better scaling and to maintain an optimized state to deliver the best peformance for our users.\n\n## The challenge with Git monorepository maintenance\n\nTo ensure that Git repositories remain performant, Git regularly runs a set of\nmaintenance tasks. On the client side, this usually happens by automatically\nrunning `git-gc(1)` periodically, which:\n\n- Compresses revisions into a `packed-refs` file.\n- Compresses objects into `packfiles`.\n- Prunes objects that aren't reachable by any of the revisions and that have\n  not been used for a while.\n- Generates and updates data structures like `commit-graphs` that help to speed\n  up queries against the Git repository.\n\nGit periodically runs `git gc --auto` automatically in the background, which\nanalyzes your repository and only performs maintenance tasks if required.\n\nAt GitLab, we can't use this infrastructure because it does not give us enough\ncontrol over which maintenance tasks are executed at what point in time.\nFurthermore, it does not give us full control over exactly which data\nstructures we opt in to. Instead, we have implemented our own maintenance\nstrategies that are specific to how GitLab works and catered to our specific\nneeds. Unfortunately, the way GitLab implemented repository maintenance has\nbeen limiting us for quite a while by now.\n\n- It is unsuitable for large monorepositories.\n- It does not give us the ability to easily iterate on the employed maintenance\n  strategy.\n\nThis post explains our previous maintenance strategy and its problems as well as\nhow we revamped the architecture to allow us to iterate faster and more\nefficiently maintain repositories.\n\n## Our previous repository maintenance strategy\n\nIn the early days of GitLab, most of the application ran on a single server.\nOn this single server, GitLab directly accessed Git repositories. For various\nreasons, this architecture limited us, so we created the standalone Gitaly\nserver that provides a gRPC API to access Git repositories.\n\nTo migrate to exclusively accessing Git repository data using Gitaly we:\n\n- Migrated all the logic that was previously contained in the Rails\n   application to Gitaly.\n- Created Gitaly RPCs and updated Rails to not execute the logic directly, but\n   instead invoke the newly-implemented RPC.\n\nWhile this was the easiest way to tackle the huge task back then, the end\nresult was that there were still quite a few areas in the Rails codebase that\nrelied on knowing how the Git repositories were stored on disk.\n\nOne such area was repository maintenance. In an ideal world, the Rails server\nwould not need to know about the on-disk state of a Git repository. Instead,\nthe Rails server would only care about the data it wants to get out of the\nrepository or commit to it. Because of the Gitaly migration path we took,\nthe Rails application was still responsible for executing fine-grained\nrepository maintenance by calling certain RPCs:\n\n- `Cleanup` to delete stale, temporary files that have accumulated\n- `RepackIncremental` and `RepackFull` to either pack all loose objects into a\n  new packfile or alternatively to repack all packfiles into a single one\n- `PackRefs` to compress all references into a single `packed-refs` file\n- `WriteCommitGraph` to update the commit-graph\n- `GarbageCollect` to perform various different tasks\n\nThese low-level details of repository maintenance were being managed by the\nclient. But because clients didn't have any information on the on-disk state of\nthe repository, they could not even determine which of these maintenance tasks\nhad to be executed in the first place. Instead, we had a very simple heuristic:\nEvery few pushes, we ran one of the above RPCs to perform one of the maintenance\ntasks. While this heuristic worked, it wasn't great for the following reasons:\n\n- Repositories can be modified without using pushes at all. So if users only\n  use the Web IDE to commit to repositories, they may not get repacked at all.\n- Because repository maintenance is controlled by the client, Gitaly can't\n  assume a specific repository state.\n- The threshold for executing housekeeping tasks is set globally across all\n  projects rather than on a per-project basis. Consequently, no matter\n  whether you have a tiny repository or a huge monorepository, we would use the\n  same intervals for executing maintenance tasks. As you may imagine though,\n  doing a full repack of a Git repository that is only a few dozen megabytes in\n  size is a few orders of magnitudes faster than repacking a monorepository\n  that is multiple gigabytes in size.\n- Specific types of Git repositories hosted by Gitaly need special care and we\n  required Gitaly clients to know about these.\n- Repository maintenance was inefficient overall. Clients do not know about the\n  on-disk state of repositories. Consequently, they had no choice except to\n  repeatedly ask Gitaly to optimize specific data structures without knowing\n  whether this was required in the first place.\n\n## Heuristical maintenance strategy\n\nIt was clear that we needed to change the strategy we used for repository\nmaintenance. Most importantly, we wanted to:\n\n- Make Gitaly the single source of truth for how we maintain repositories.\n  Clients should not need to worry about low-level specifics, and Gitaly should\n  be able to easily iterate on the strategy.\n- Make the default maintenance strategy work for repositories of all sizes.\n- Make the maintenance strategy work for repositories of all types. A client\n  should not need to worry about which maintenance tasks must be executed for\n  what repository type.\n- Avoid optimizing data structures that already are in an optimal state.\n- Improve visibility into the optimizations we perform.\n\nAs mentioned in the introduction, Git periodically runs `git gc --auto`. This\ncommand inspects the repository's state and performs optimizations only when it\nfinds that the repository is in a sufficiently bad state to warrant the cost.\nWhile using this command directly in the context of Gitaly does not give us\nenough flexibility, it did serve as the inspiration for our new architecture.\n\nInstead of providing fine-grained RPCs to maintain various parts of a Git\nrepository, we now only provide a single RPC `OptimizeRepository` that works as\na black-box to the caller. This RPC call:\n\n1. Cleans up stale data in the repository if there is any.\n1. Analyzes the on-disk state of the repository.\n1. Depending on this on-disk state, performs only these maintenance tasks that\n   are deemed to be necessary.\n\nBecause we can analyze and use the on-disk state of the repository, we can be\nfar more intelligent about repository maintenance compared to the previous\nstrategy where we optimized some bits of the repository every few pushes.\n\n### Packing objects\n\nIn the old-style repository maintenance, the client would call either\n`RepackIncremental` or `RepackFull`. This would either: Pack all loose objects into a new `packfile` or repack all objects into a single `packfile`.\n\nBy default, we would perform a full repack every five repacks. While this may be\na good default for small repositories, it gets prohibitively expensive for huge\nmonorepositories where a full repack may easily take several minutes.\n\nThe new heuristical maintenance strategy instead scales the allowed number of\n`packfiles` by the total size of all combined `packfiles`. As a result, the\nlarger the repository becomes, the less frequently we perform a full repack.\n\n### Pruning objects\n\nIn the past, clients would periodically call `GarbageCollect`. In addition to\nrepacking objects, this RPC would also prune any objects that are unreachable\nand that haven't been accessed for a specific grace period.\n\nThe new heuristical maintenance strategy scans through all loose objects that\nexist in the repository. If the number of loose objects that have a modification\ntime older than two weeks exceeds a certain threshold, it spawns the\n`git prune` command to prune these objects.\n\n### Packing references\n\nIn the past, clients would call `PackRefs` to repack references into the\n`packed-refs` file.\n\nBecause the time to compress references scales with the size of the\n`packed-refs` file, the new heuristical maintenance strategy takes into account\nboth the size of the `packed-refs` file and the number of loose references that\nexist in the repository. If a ratio between these two figures is exceeded, we\ncompress the loose references.\n\n### Auxiliary data structures\n\nThere are auxiliary data structures like `commit-graphs` that are used by Git\nto speed up various queries. With the new heuristical maintenance strategy,\nGitaly now automatically updates these as required, either when they are\ndeemed to be out-of-date, or when they are missing altogether.\n\n### Heuristical maintenance strategy rollout\n\nWe rolled out this new heuristical maintenance strategy to GitLab.com in March 2022. Initially, we only rolled it out for\n[`gitlab-org/gitlab`](https://gitlab.com/gitlab-org/gitlab), which is a\nrepository where maintenance performed particularly poorly in the past. You can\nsee the impact of the rollout in the following graph:\n\n![Latency of OptimizeRepository for gitlab-org/gitlab](https://about.gitlab.com/images/blogimages/repo-housekeeping-gitlab-org-gitlab-latency.png)\n\nIn this graph, you can see that:\n\n1. Until March 19, we used the legacy fine-grained RPC calls. We spent most\n   of the time in `RepackFull`, followed by `RepackIncremental` and `GarbageCollect`.\n1. Because March 19 and 20 occurred on a weekend, nothing much happens with\n   housekeeping.\n1. Early on March 21 we switched `gitlab-org/gitlab` to use heuristical\n   housekeeping using `OptimizeRepository`. Initially, there didn't seem to be\n   much of an improvement. There wasn't much difference in how much time we\n   spent maintaining this repository compared to the past.\n\n   However, this was caused by an inefficient heuristic. Instead of only pruning\n   objects when there were stale ones, we always pruned objects when we saw that\n   there were too many loose objects.\n1. We deployed a fix for this bug on March 22, which led to a significant drop in\n   time spent optimizing this repository compared to before.\n\nThis demonstrated two things:\n\n- We're easily able to iterate on the heuristics that we have in Gitaly.\n- Using the heuristics saves a lot of compute time as we don't unnecessarily\n  optimize anymore.\n\nWe have subsequently rolled this out to all of GitLab.com, starting on March\n29, 2022, with similar improvements. With this change, we more than halved the CPU\nload when performing repository optimizations.\n\n## Observability\n\nWhile it is great that `OptimizeRepository` has managed to save us a lot of\ncompute power, one goal was to improve visibility into repository housekeeping.\nMore specifically, we wanted to:\n\n- Gain visibility on the global level to see what optimizations are performed\n  across all of our repositories.\n- Gain visibility on the repository level to know what state a specific\n  repository is in.\n\nIn order to improve global visibility, we expose a set of Prometheus metrics that\nallow us to observe important details about our repository maintenance. The\nfollowing graphs show the optimizations performed in a 30-minute window of our\nproduction systems on GitLab.com.\n\n- The optimizations, which are being performed in general.\n\n  ![Repository optimization metrics for GitLab.com](https://about.gitlab.com/images/blogimages/repo-housekeeping-metrics-optimizations.png)\n\n- The average latency it takes to perform each of these optimizations.\n\n  ![Repository optimization metrics for GitLab.com](https://about.gitlab.com/images/blogimages/repo-housekeeping-metrics-latencies.png)\n\n- What kind of stale data we are cleaning up.\n\n  ![Repository optimization metrics for GitLab.com](https://about.gitlab.com/images/blogimages/repo-housekeeping-metrics-cleanups.png)\n\nTo improve visibility into the state each repository is in we have started to\nlog structured data that includes all the relevant bits. A subset of the\ninformation it exposes is:\n\n- The number of loose objects and their sizes.\n- The number of `packfiles` and their combined size.\n- The number of loose references.\n- The size of the `packed-refs` file.\n- Information about `commit-graphs`, bitmaps and other auxiliary data\n  structures.\n\nThis information is also exposed through Prometheus metrics:\n\n![Repository state metrics for GitLab.com](https://about.gitlab.com/images/blogimages/repo-state-metrics.png)\n\nThese graphs expose important metrics of the on-disk state of our repositories:\n\n- The top panel shows which data structures exist.\n- The heatmaps on the left show how large specific data structures are.\n- The heatmaps on the right show how many of these data structures we have.\n\nCombining both the global and per-repository information allows us to easily\nobserve how repository maintenance behaves during normal operations. But more\nimportantly, it gives us meaningful data when rolling out new features that\nchange the way repositories are maintained.\n\n## Manually enabling heuristical housekeeping\n\nWhile the heuristical housekeeping is enabled by default starting with GitLab\n15.9, it has already been introduced with GitLab 14.10. If you want to use the\nnew housekeeping strategy before upgrading to 15.9, you can opt in by\nsetting the `optimized_housekeeping` [feature flag](https://docs.gitlab.com/ee/administration/feature_flags.html#how-to-enable-and-disable-features-behind-flags).\nYou can do so via the `gitlab-rails` console:\n\n```\nFeature.enable(:optimized_housekeeping)\n```\n\n## Future improvements\n\nWhile the new heuristical optimization strategy has been successfully\nbattle-tested for a while now for GitLab.com, at the time of writing this\nblog post, it still wasn't enabled by default for self-deployed installations.\nThis has finally changed with GitLab 15.8, where we have default-enabled the new\nheuristical maintenance strategy.\n\nWe are not done yet, though. Now that Gitaly is the only source of truth for how\nrepositories are optimized, we are tracking improvements to our maintenance\nstrategy in [epic 7443](https://gitlab.com/groups/gitlab-org/-/epics/7443):\n\n- [Multi-pack indices](https://git-scm.com/docs/multi-pack-index) and geometric\n  repacking will help us to further reduce the time spent repacking objects.\n- [Cruft packs](https://git-scm.com/docs/cruft-packs) will help us to further\n  reduce the time spent pruning objects and reduce the overall size of\n  unreachable objects.\n- Gitaly will automatically run housekeeping tasks when receiving mutating RPC\n  calls so that clients don't have to call `OptimizeRepository` at all anymore.\n\nSo stay tuned!\n\n",[757,864,9],{"slug":1670,"featured":6,"template":688},"scaling-repository-maintenance","content:en-us:blog:scaling-repository-maintenance.yml","Scaling Repository Maintenance","en-us/blog/scaling-repository-maintenance.yml","en-us/blog/scaling-repository-maintenance",{"_path":1676,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1677,"content":1682,"config":1688,"_id":1690,"_type":13,"title":1691,"_source":15,"_file":1692,"_stem":1693,"_extension":18},"/en-us/blog/sharing-slis-across-departments",{"title":1678,"description":1679,"ogTitle":1678,"ogDescription":1679,"noIndex":6,"ogImage":876,"ogUrl":1680,"ogSiteName":672,"ogType":673,"canonicalUrls":1680,"schema":1681},"How we share SLIs across engineering departments","The Scalability team engages with the Development department for collaborating on SLIs. The first post in this series explains how we made available information accessible for development groups.","https://about.gitlab.com/blog/sharing-slis-across-departments","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we share SLIs across engineering departments\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Bob Van Landuyt\"}],\n        \"datePublished\": \"2022-03-10\",\n      }",{"title":1678,"description":1679,"authors":1683,"heroImage":876,"date":1685,"body":1686,"category":681,"tags":1687},[1684],"Bob Van Landuyt","2022-03-10","\nAt GitLab everyone can contribute to GitLab.com's availability. We\nmeasure the availability using several Service Level Indicators (SLIs)\nBut it's not always easy to see how the features you're building are\nperforming. GitLab's features are divided amongst development groups,\nand every group has [their own dashboard](https://docs.gitlab.com/ee/development/stage_group_observability/index.html)\ndisplaying an availability score.\n\n![Stage group availability](https://about.gitlab.com/images/blogimages/2022-02-share-infrastructure-slis/2022-02-23-code_review_availability.png)\n\nWhen a group's availability goes below 99.95%, we work with the group\non figuring out why that is and how we can improve the performance or\nreliability of the features that caused their number to drop. The\n99.95% service level objective (SLO) is the same target the\ninfrastructure department has set for\n[GitLab.com availability](/handbook/engineering/infrastructure/performance-indicators/#gitlabcom-availability).\n\nBy providing specific data about how features perform on our production systems, it has become easier to recognize when it is important to prioritize performance and availability work.\n\n## Service availability on GitLab.com\n\nOur infrastructure is separated into multiple services, handling\ndifferent kinds of traffic but running the same monolithic Rails\napplication. Not all features have a similar usage pattern. For\nexample, on the service handling web requests for GitLab.com we see a\nlot more requests related to `code_review` or `team_planning` than we\ndo related to `source_code_management`. It's important that we\nlook at these in isolation as well as a service aggregate.\n\nThere's nobody who knows better how to interpret these numbers in\nfeature aggregations than the people who build these features.\n\nThis number is sourced by the same SLIs that we use to monitor\nGitLab.com's availability. We calculate this by dividing the number of\nsuccessful measurements by the total number of measurements over the\npast 28 days. A measurement could be several things, most commonly a\nrequest handled by our Rails application or a background job.\n\n## Monitoring feature and service availability\n\nFor monitoring GitLab.com we have Grafana dashboards, generated using\n[Grafonnet](https://grafana.github.io/grafonnet-lib/), that show these\nsource metrics in several dimensions. For example, these are error\nrates of our monolithic Rails application, separated by feature:\n\n![Puma SLI by feature](https://about.gitlab.com/images/blogimages/2022-02-share-infrastructure-slis/2022-02-23-puma_sli_per_feature.png)\n\nWe also generate [multiwindow, multi-burn-rate alerts](https://sre.google/workbook/alerting-on-slos/#short_and_long_windows_for_alerting)\nas defined in Google's SRE workbook.\n\n![Puma SLI error rate and requests per second](https://about.gitlab.com/images/blogimages/2022-02-share-infrastructure-slis/2022-02-23-puma_sli.png)\n\nThe red lines represent alerting thresholds for a burn rate. The\nthin threshold means we'll alert if the SLI has spent more than 5%\nof its monthly error budget in the past 6 hours. The thicker\nthreshold means we'll alert when the SLI has not met SLO for more than\n2% of measurements in the past hour.\n\nBecause both GitLab.com's availability number and the availability\nnumber for development groups are sourced by the same metrics, we\ncan provide similar alerts and graphs tailored to the\ndevelopment groups. Features with a relatively low amount of traffic would not easily show\nproblems in our bigger service aggregations. With this mechanism we can see those problems\nand put them on the radar of the teams building those features.\n\n## Building and adoption\n\nIn upcoming posts, we will talk about how we built this tooling and how we worked with other teams to have this adopted into the product prioritization process.\n\n## Related content\n\n- [Our project to provide more detailed data on the stage group dashboards](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/664)\n- [Development documentation for how to change dashboard content](https://docs.gitlab.com/ee/development/stage_group_observability/index.html)\n",[864,9,732,987],{"slug":1689,"featured":6,"template":688},"sharing-slis-across-departments","content:en-us:blog:sharing-slis-across-departments.yml","Sharing Slis Across Departments","en-us/blog/sharing-slis-across-departments.yml","en-us/blog/sharing-slis-across-departments",{"_path":1695,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1696,"content":1702,"config":1708,"_id":1710,"_type":13,"title":1711,"_source":15,"_file":1712,"_stem":1713,"_extension":18},"/en-us/blog/the-gitlab-guide-to-modern-software-testing",{"title":1697,"description":1698,"ogTitle":1697,"ogDescription":1698,"noIndex":6,"ogImage":1699,"ogUrl":1700,"ogSiteName":672,"ogType":673,"canonicalUrls":1700,"schema":1701},"The GitLab guide to modern software testing","If test is your DevOps team's Public Enemy No. 1, it's time to rethink your strategy. Here's what you need to know about modern software testing.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749668307/Blog/Hero%20Images/test-automation-devops.jpg","https://about.gitlab.com/blog/the-gitlab-guide-to-modern-software-testing","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"The GitLab guide to modern software testing\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Valerie Silverthorne\"}],\n        \"datePublished\": \"2022-08-18\",\n      }",{"title":1697,"description":1698,"authors":1703,"heroImage":1699,"date":1705,"body":1706,"category":730,"tags":1707},[1704],"Valerie Silverthorne","2022-08-18","\nWhat's the trickiest part of DevOps? It's software testing, hands down. Year after year, respondents to our [annual DevSecOps surveys](/developer-survey/) have called out testing as the most likely reason for release delays. And that's not all they said: \"Testing takes too long,\" \"There are too many tests,\" \"We need to do more testing,\" “We need more automated testing but don't have time,\" \"Testing happens too late,\" etc.\n\nClearly something this fraught needs all the help, so here is our best advice to get testing \"just right\" in any modern DevOps practice. \n\n## Use the right metrics\n\nAll of the testing in the world doesn't matter if a DevOps team is measuring the wrong things. At GitLab, we use industry-standard metrics, but we look at them a bit differently. When it comes to S1 and S2 bugs we don’t count the time to close but rather the age of the bugs that remain open. Our reasoning? We want to look forward, but we also don't want to [incentivize closing only newer bugs](/blog/gitlab-top-devops-tooling-metrics-and-targets/). So it's important to make sure DevOps teams are looking at the right metrics and with shared goals in mind.\n\n## Forget flaky\n\nTests are noisy, and they can be flaky, setting off alarms and disrupting developer flow, often for no reason. That's at the heart of developer frustration with testing, and one of the biggest problems DevOps teams need to solve. GitLab's Vice President of Quality [Mek Stittri](/company/team/#meks) suggests re-thinking how automated tests are created. Tests need to be validating the right things, but that must include looking at how all of the code components work together and not just at pieces of code. Finally, it doesn't hurt to [develop a manual testing mindset](/blog/software-test-at-gitlab/).\n\n## Make it modern\n\nIn fact, a manual testing mindset, where test designers create tests that actually mimic what real users do, is a key underpinning of modern software testing in DevOps. Testers need to consider getting certified, embracing new technologies like AI, and, perhaps most importantly, be [evangelists for quality](/blog/how-to-leverage-modern-software-testing-skills-in-devops/) on a DevOps team.\n\n## Make automation work harder\n\nSoftware testing may be the most annoying DevOps step, but there's no doubt that automating the process makes everything work more smoothly. Teams with test automation [have fewer complaints about release delays](/blog/want-faster-releases-your-answer-lies-in-automated-software-testing/). And teams that have taken it up a notch and added AI/ML into their test automation process are even more upbeat about testing. After all, bots [don't need to take a lunch break or a vacation](/blog/the-software-testing-life-cycle-in-2021-a-more-upbeat-outlook/). Finally, if automation is well thought out, QA and developers can [actually work together to get code out the door](/blog/what-blocks-faster-code-release/).\n\n## Test for everything\n\nFor all the developer finger-pointing around software testing, it's also clear from our surveys that _more_ testing – of everything – has to happen. When considering how to modernize a software testing strategy, don't forget that \"nice to haves\" like [accessibility testing](/blog/introducing-accessibility-testing-in-gitlab/) aren't actually optional but critical for success.\n\nAnd also don't overlook the potential of newer test techniques like [fuzzing](/blog/why-continuous-fuzzing/), which can work with [Go](/blog/how-to-fuzz-go/), [Rust](/blog/how-to-fuzz-rust-code/), and other languages, and take testing into places other methodologies cannot.\n\n## The bottom line\n\nTesting doesn't have to be the enemy of speedy releases or the object of so much frustration. Start fresh with a modern software testing approach and and make it easy for teams to get the most out of QA.\n",[1249,707,9],{"slug":1709,"featured":6,"template":688},"the-gitlab-guide-to-modern-software-testing","content:en-us:blog:the-gitlab-guide-to-modern-software-testing.yml","The Gitlab Guide To Modern Software Testing","en-us/blog/the-gitlab-guide-to-modern-software-testing.yml","en-us/blog/the-gitlab-guide-to-modern-software-testing",{"_path":1715,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1716,"content":1722,"config":1728,"_id":1730,"_type":13,"title":1731,"_source":15,"_file":1732,"_stem":1733,"_extension":18},"/en-us/blog/the-gitlab-quarterly-how-our-latest-beta-releases-support-developers",{"title":1717,"description":1718,"ogTitle":1717,"ogDescription":1718,"noIndex":6,"ogImage":1719,"ogUrl":1720,"ogSiteName":672,"ogType":673,"canonicalUrls":1720,"schema":1721},"The GitLab Quarterly: How our latest beta releases support developers","The Value Streams Dashboard and Remote Development provide the capabilities needed to support DevSecOps teams and stay competitive.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749668367/Blog/Hero%20Images/innovation-unsplash.jpg","https://about.gitlab.com/blog/the-gitlab-quarterly-how-our-latest-beta-releases-support-developers","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"The GitLab Quarterly: How our latest beta releases support developers\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Dave Steer\"}],\n        \"datePublished\": \"2023-01-24\",\n      }",{"title":1717,"description":1718,"authors":1723,"heroImage":1719,"date":1725,"body":1726,"category":730,"tags":1727},[1724],"Dave Steer","2023-01-24","\nIt’s easy to say that 2023 will be the year of innovation, but with the macroeconomic environment requiring an obsessive eye on cost efficiencies, and in some cases, cost-cutting, exactly how are organizations supposed to stay competitive when it comes to software development and delivery? The answer is clear: Stay focused on supporting your developers. Our two new beta releases help you do just that.\n\nThe GitLab Value Streams Dashboard, now available in private beta, ensures that all stakeholders have visibility, early and in real time, into the progress and value delivery metrics associated with software development and delivery. With everyone on the same page, discussions can be had and adjustments made before developers face obstacles or stall out waiting for decision-makers to get up to speed. Developers can also see, at-a-glance, their impact on the idea-to-customer value chain. The goal: Reduce idle time so that developers can spend more time developing and IT leaders can better unlock their transformation results. Keeping the creativity flowing can boost developer happiness and help provide a glide path for software to make its way into the market and add value. \n\nOur other beta release, GitLab Remote Development, can enable organizations to directly support developers by letting them establish an environment that best suits their needs, including where, when, and how they prefer to work. GitLab Remote Development doesn’t require developers to set up and manage local development environments, which keeps workflow distractions to a minimum. Stripping away location, device, and complex toolchain barriers can maximize developer satisfaction, which can lead to increased ingenuity and productivity.\n\nAn overarching aspect of this developer support is that it is available on a single DevSecOps platform so you don’t have to tack on something special to achieve these goals — the tools are all there and ready to be used to create better software faster.\n\nNow, let’s dig deeper into these capabilities and how they will help you support your developers and deliver value to your customers.\n\n## GitLab Value Streams Dashboard\n\nIn many conversations we have with customers, lack of visibility into metrics for software development value streams comes up as a pain point. Value streams – the process from idea to delivering customer value – should be the epicenter for understanding the progress, blockers, timelines, and costs associated with your development projects. Without this insight, innovation with an eye to cost efficiencies is virtually impossible. It is also difficult to properly support developers through fast, informed decision-making if everyone doesn’t have access to the same real-time data. \n\nThe GitLab Value Streams Dashboard gives stakeholders a bird's-eye view of their teams’ software delivery metrics (such as [DORA metrics](https://docs.gitlab.com/ee/user/analytics/dora_metrics.html) and [flow metrics](https://docs.gitlab.com/ee/user/analytics/value_stream_analytics.html)) for continuous improvement. DevSecOps teams can identify and fix inefficiencies and bottlenecks in their software delivery workflows, which can improve the overall productivity and stability of their development environment. \n\n> \"Our team is excited to try out the DORA metrics capabilities available in the private beta for the new Value Streams Dashboard. We look forward to using other widgets as the Value Streams Dashboard matures, which we hope will greatly improve our productivity and efficiency.\"  \n> _**Rob Fulwell, Staff Engineer, Conversica**_\n\nThe first iteration of the GitLab Value Streams Dashboard enables teams to continuously improve software delivery workflows by benchmarking key DevOps metrics to help improve productivity, efficiency, scalability, and performance. Tracking and comparing these metrics over a period of time helps teams catch downward trends early, drill down into individual projects/metrics, take remedial actions to maintain their software delivery performance, and track progress of their innovation investments.\n\nLeadership can support developers by using information from the dashboard to cross-pollinate and promote best practices, add resources to projects based on metrics, and eliminate common bottlenecks across projects. \n\n\n\n### Roadmap for Value Streams Dashboard\n\nWe are just getting started with delivering capabilities in our Value Streams Dashboard. The roadmap includes planned features and functionality that will continue to improve decision-making and operational efficiencies.\n\nHere are some of the capabilities we plan to focus on next:\n\n1. New visualizations such as overview widgets, [top view treemap](https://gitlab.com/gitlab-org/gitlab/-/issues/381306), and [DORA performance score chart](https://gitlab.com/gitlab-org/gitlab/-/issues/386843)\n2. Security and vulnerability benchmarking  to enable executives to better understand an organization’s security exposure \n3. A new [data warehouse](https://gitlab.com/groups/gitlab-org/-/epics/9318?_gl=1*1orel9k*_ga*ODExMTUxMDcwLjE2Njk3MDM3Njk.*_ga_ENFH3X7M5Y*MTY3MjkxMTgxMC43Ny4xLjE2NzI5MTI0MTIuMC4wLjA.) that supports fast analytical queries and deep data analysis\n4. Additional business value metrics such as adoption, OKRs, revenue, costs, CSAT that align technical and business goals\n\n[Learn more on our direction page](/direction/plan/value_stream_management/).\n\n### Join the beta: We welcome your contributions\n\nAs we iterate on this new offering, GitLab Premium and Ultimate customers are invited to [join our private beta](https://about.gitlab.com/value-streams-dashboard).\n\nWe also invite you to learn more about [Value Streams Dashboard](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html) and [follow along](https://gitlab.com/groups/gitlab-org/-/epics/9317) on the timeline to General Availability.\n\n## GitLab Remote Development\n\nThe increasing adoption of reproducible, ephemeral, cloud-based development environments has accelerated software development. But for developers, frequent context-switching between different environments, navigating complex and extensive toolchains, and managing a local development environment can create friction. GitLab Remote Development helps organizations better support developers by enabling them to spend less time managing their development environment and more time contributing high-quality code.\n\n> \"While a number of stakeholders are critical to successful DevOps, software developers are key for a successful DevOps implementation. Thus, organizations must adequately support developers. This means providing good developer experiences that are not disruptive or intrusive, but that are nonetheless sanctioned by the company, and that remain secure and compliant through automation and abstraction.\"  \n> _**Jay Lyman, 451 Research, a part of S&P Global Market Intelligence, \"Traditional IT teams, leadership stand out as additional DevOps stakeholders – Highlights from VotE: DevOps,\" January 4, 2023**_ \n\nThe centerpiece of GitLab Remote Development is our newly released Web IDE Beta, now the default web IDE experience on GitLab. The Web IDE makes it possible to securely connect to a remote development environment, run commands in an interactive terminal panel, and get real-time feedback from right inside the Web IDE. Understanding that developer familiarity is important, the Web IDE Beta uses a more powerful VS code interface and is able to handle many of the most frequently performed tasks on the existing Web IDE, including committing changes to multiple files and reviewing merge request diffs.\n\nGitLab Remote Development also creates a more secure development experience by enabling organizations to implement a [zero-trust policy](/blog/why-devops-and-zero-trust-go-together/) that prevents source code and sensitive data from being stored locally across numerous developer devices. In addition, organizations can adhere to compliance requirements by ensuring developers are working with approved environments, libraries, and dependencies. \n\nIt’s interesting to note that we deployed the Web IDE beta turned on as default and currently 99.9% of users have kept it toggled on. I encourage you to learn more about the [new Web IDE functionality](/blog/get-ready-for-new-gitlab-web-ide/) in our recent blog post. \n\n### Roadmap for Remote Development\n\nAs iteration continues on the GitLab remote development experience, the roadmap currently focuses on the following functionality next: \n\n1. Provision instances of remote development environments on demand in the customer’s choice of cloud provider.\n2. Allow teams to share complex, multi-repo environments.\n3. Connect from a variety of IDEs, including VS Code, JetBrains, Vim, or the Web IDE.\n4. Ensure an organization’s remote environments conform to its software supply chain security requirements with advanced security tools, authorization, reports, and audit logs.\n\n[Learn more on our direction page](/direction/create/ide/remote_development/).\n\n## Engage with DevSecOps experts\n\nWant to dig deeper into how to innovate while still keeping an eye on cost efficiencies? Join me for our webcast “[GitLab’s DevSecOps Innovations and Predictions for 2023](https://page.gitlab.com/webcast-gitlab-devsecops-innovations-predictions-2023.html?utm_medium=blog&utm_source=gitlab&utm_campaign=devopsgtm&utm_content=fy23q4release)” on Jan. 31 to get expert advice and insights about this era of DevSecOps transformation and the tools and strategies you’ll need to meet this challenge. \n\n[Register today](https://page.gitlab.com/webcast-gitlab-devsecops-innovations-predictions-2023.html?utm_medium=blog&utm_source=gitlab&utm_campaign=devopsgtm&utm_content=fy23q4release)!\n\n**Disclaimer**: This blog contains information related to upcoming products, features, and functionality. It is important to note that the information in this blog post is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab.\n\n\n_Cover image by [Skye Studios](https://unsplash.com/@skyestudios?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com)_\n  \n",[707,758,9,823],{"slug":1729,"featured":6,"template":688},"the-gitlab-quarterly-how-our-latest-beta-releases-support-developers","content:en-us:blog:the-gitlab-quarterly-how-our-latest-beta-releases-support-developers.yml","The Gitlab Quarterly How Our Latest Beta Releases Support Developers","en-us/blog/the-gitlab-quarterly-how-our-latest-beta-releases-support-developers.yml","en-us/blog/the-gitlab-quarterly-how-our-latest-beta-releases-support-developers",{"_path":1735,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1736,"content":1742,"config":1748,"_id":1750,"_type":13,"title":1751,"_source":15,"_file":1752,"_stem":1753,"_extension":18},"/en-us/blog/the-importance-of-compliance-in-devops",{"title":1737,"description":1738,"ogTitle":1737,"ogDescription":1738,"noIndex":6,"ogImage":1739,"ogUrl":1740,"ogSiteName":672,"ogType":673,"canonicalUrls":1740,"schema":1741},"The importance of compliance in DevOps","A basic understanding of what compliance means and how it impacts DevOps.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749670037/Blog/Hero%20Images/auto-deploy-google-cloud.jpg","https://about.gitlab.com/blog/the-importance-of-compliance-in-devops","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"The importance of compliance in DevOps\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Lauren Minning\"}],\n        \"datePublished\": \"2022-08-15\",\n      }",{"title":1737,"description":1738,"authors":1743,"heroImage":1739,"date":1745,"body":1746,"category":925,"tags":1747},[1744],"Lauren Minning","2022-08-15","\n\nDevOps teams must develop secure software, but a key part of security is compliance. Achieving compliance can be time-consuming, stressful, and resource intensive, but it’s increasingly a job DevOps teams – and developers specifically – are being asked to bake into their processes. \n\nHere’s a look at how compliance in DevOps works.\n\n## It starts with standards\n\nOrganizations of all sizes rely on nationally or internationally recognized standards to prove their security postures to customers, partners, and shareholders. Companies need to create systems that streamline compliance with a potentially large number of standards, such as [NIST](https://www.nist.gov), [ISO](https://www.iso.org/home.html), [SLSA levels](https://slsa.dev/spec/v0.1/index), [GDPR](https://gdpr-info.eu), [SOX](https://en.wikipedia.org/wiki/Sarbanes–Oxley_Act), [SOC2](https://us.aicpa.org/interestareas/frc/assuranceadvisoryservices/aicpasoc2report), [PCI DSS](https://www.pcisecuritystandards.org), [HIPAA](https://www.cdc.gov/phlp/publications/topic/hipaa.html), and [HITECH](https://www.hhs.gov/hipaa/for-professionals/special-topics/hitech-act-enforcement-interim-final-rule/index.html). At GitLab, we know exactly how difficult this is as we went through the [SOC 2 compliance process](/blog/benefits-of-transparency-in-compliance/) ourselves, as well as many other compliance initiatives.\n\nPreviously, tackling compliance requirements involved spreadsheets, checklists, and cross-functional teams of people digging for data. Being certified compliant was critical to a business, but not critical enough to codify and streamline the process... and that was before the advent of the cloud where the data could literally be anywhere and everywhere.\n\n“It's incredibly difficult to know if you’ve done the right things to stay secure and compliant, especially in an increasingly complex environment of cloud-native applications, infrastructure-as-code, microservices, and more open source components,” explains Dave Steer, GitLab vice president of product and solutions marketing.\n\nThat's where automation, cooperation, and collaboration -- and DevOps -- come in.\n\n## Creating cohesion\n\nIt’s well known how developers and security pros have [struggled to find common ground](/blog/developer-security-divide/) around secure software development and compliance is one step further down an already rocky path of cooperation. But embedding compliance in DevOps can happen with the right mix of culture and technology. To start, it’s important to decide which standards apply to your organization and if compliance will be kept separate from security, or integrated as part of the same team. Either way, security and compliance work together by one feeding into the other. Compliance sets the parameters for meeting regulatory requirements and security executes the actions to meet those requirements. \n\nAnd that’s when the fun can really begin. The “beating heart” of DevOps is automation and if ever there is a process that is crying out to be automated and literally built into DevOps it’s compliance. There are three main ways DevOps teams can streamline the compliance process:\n\n- **Make compliance standards part of the CI/CD pipeline.** While this might not work for every compliance requirement, it eliminates the need for a manual checklist and provides a clear audit trail and a hard stop if there’s an issue because the pipeline will fail.\n\n- **Leverage containers.** When teams are certain a process or technology is compliant, it can be made into a container image. Over time, these “Golden Images” as [Martin Fowler refers to them](https://martinfowler.com/articles/devops-compliance.html) can be assembled as guiding lights of compliance.\n\n- **Establish a system of record, or SOR.** An SOR will allow a DevOps team to track compliance just before a change is made to the code or the process.\n\n## Is your software supply chain secure?\n\nAs we continue to navigate an always-evolving modern DevOps environment, it’s important to be aware that compliance and security are coming together under one primary theme moving forward: software supply chain security.\n\n[Software supply chain security](/blog/gitlab-supply-chain-security/) is fast becoming the compliance and security umbrella which is supported by security scanning, policy automation/guardrails, [securing the software factory itself](/blog/elite-team-strategies-to-secure-software-supply-chains/), and common controls embedded within the software factory. \n\nCombined with continuous maintenance of compliance and security regulations, automated DevOps practices have the potential to help discover security and compliance issues faster and address threats more quickly and effectively. \n\nIt's imperative that organizations understand how to comply with required regulations. Learn how GitLab helps organizations achieve [continuous compliance](/solutions/compliance/) and about our [software supply chain security direction](/direction/supply-chain/).\n",[707,925,9],{"slug":1749,"featured":6,"template":688},"the-importance-of-compliance-in-devops","content:en-us:blog:the-importance-of-compliance-in-devops.yml","The Importance Of Compliance In Devops","en-us/blog/the-importance-of-compliance-in-devops.yml","en-us/blog/the-importance-of-compliance-in-devops",{"_path":1755,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1756,"content":1762,"config":1768,"_id":1770,"_type":13,"title":1771,"_source":15,"_file":1772,"_stem":1773,"_extension":18},"/en-us/blog/the-road-to-gitaly-1-0",{"title":1757,"description":1758,"ogTitle":1757,"ogDescription":1758,"noIndex":6,"ogImage":1759,"ogUrl":1760,"ogSiteName":672,"ogType":673,"canonicalUrls":1760,"schema":1761},"GitLab no longer requires NFS: The road to Gitaly v1.0","How we went from vertical to horizontal scaling without depending on NFS by creating our own Git RPC service.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749670092/Blog/Hero%20Images/road-to-gitaly.jpg","https://about.gitlab.com/blog/the-road-to-gitaly-1-0","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"The road to Gitaly v1.0 (aka, why GitLab doesn't require NFS for storing Git data anymore)\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Zeger-Jan van de Weg\"}],\n        \"datePublished\": \"2018-09-12\",\n      }",{"title":1763,"description":1758,"authors":1764,"heroImage":1759,"date":1765,"body":1766,"category":681,"tags":1767},"The road to Gitaly v1.0 (aka, why GitLab doesn't require NFS for storing Git data anymore)",[901],"2018-09-12","\nIn the early days of [GitLab.com](https://gitlab.com), most of the application,\nincluding Rails worker processes, Sidekiq background processes, and Git storage,\nall ran on a single server. A single server is easy to deploy to and maintain.\nThe same structure is what most smaller GitLab instances still use for their\nself-managed [Omnibus](https://docs.gitlab.com/omnibus/) installation. Scaling\nis done vertically, meaning; adding more RAM, CPU, and disk space.\n\n## Moving from vertical to horizontal scaling\n\nSoon we ran out of options to continue scaling the system vertically, and we had\nto move to scaling horizontally by adding new servers. To have the repositories\navailable on all the nodes, NFS (Network File System) was used to mount these to each application\nserver and background workers. NFS is a well-known technology for sharing file\nsystems across a network. For each server, each storage node needed to be\nmounted. The advantage: GitLab.com could keep adding more servers and scale. However NFS\nhad multiple disadvantages too: the visibility is decreased to what type of file\nsystem operation is performed. Even worse, one NFS storage node's outage impacted\nthe whole site, and took the whole site down. On the other hand, Git operations\ncan be quite CPU/IOPS intensive too, so we began a balancing act between adding more nodes,\nand thus reducing reliability, versus scaling nodes vertically.\n\n## Considering NFS alternatives\n\nOver two years ago, we started to look for alternatives. One of the first ideas\nwas to remove the dependency on NFS with [Ceph](https://ceph.com/).\nCeph is a distributed file system that was meant to replace NFS in an\narchitecture like ours. Like NFS, this would solve our scaling problem on the\nsystem level, meaning that little to no changes would be required to GitLab as\nan application. However, running a Ceph cluster in the cloud didn't have the\nperformance characteristics that were required. Briefly we flirted with the idea\nof [moving away from the cloud][no-cloud], but this would have had major implications\nfor our own infrastructure team, and given that many of our customers _do_ run in\nthe cloud, [we decided to stay in the cloud][yes-cloud].\n\n[no-cloud]: /blog/why-choose-bare-metal/\n[yes-cloud]: /2017/03/02/why-we-are-not-leaving-the-cloud/\n\n## Introducing Gitaly\n\nSo it was clear that the application needed to be redesigned, and a new service\nwould be introduced to handle all Git requests. We named it\n[Gitaly](https://gitlab.com/gitlab-org/gitaly).\n\n![Gitaly Architecture Diagram](https://about.gitlab.com/images/gitaly_arch.png){: .large.center}\n*\u003Csmall>The planned architecture at the project start\u003C/small>*\n\nAs the diagram shows, the new Git server would have a number of distinct clients.\nTo make sure the protocol for the server and its clients is well defined,\n[Protocol Buffers][protobuf] was used. The client calls are handled by\nleveraging [gRPC][grpc]. Combined, they allowed us to iteratively add RPCs and\nmove away from NFS, in favor of an HTTP boundary. With the technologies chosen,\nthe migration started. The ultimate goal: v1.0, meaning no disk access was\nrequired to the Git storage nodes for [GitLab.com](https://gitlab.com).\n\nShipping such an architectural change should not influence the performance, nor\nthe stability of the self-managed installations of GitLab, so for each RPC a [feature\nflag](https://docs.gitlab.com/ee/development/feature_flags/index.html) gated the use of it. When the RPC had gone through a series of tests on both\ncorrectness and performance impact, the gate was removed. To determine stability we used\n[Prometheus](https://docs.gitlab.com/ee/administration/monitoring/prometheus/) for monitoring and the ELK stack for sifting through massive numbers of structured log messages.\n\nThe server was written in Go, while the application is a large Rails monolith.\nRails had a great amount of code that was still very valuable. This code got\nextracted to the `lib/gitlab/git` directory, allowing easier vendoring. The idea\nwas to start a sidecar next to the Go server, reusing the old code. About once a week the\ncode would be re-vendored. This allowed Ruby developers on other teams to\nwrite code once, and ship it. Bonus points could be earned if [the boilerplate code][gitaly-ruby]\nwas written to call the same function in Ruby!\n\n[protobuf]: https://developers.google.com/protocol-buffers/\n[gitaly-ruby]: https://gitlab.com/gitlab-org/gitaly/blob/232c26309a8e9bef61262ccd04a8f0ba75e13d73/doc/beginners_guide.md#gitaly-ruby-boilerplate\n[grpc]: https://grpc.io/\n\nThe new service wasn't all sunshine and rainbows though, at times it felt like\nthe improved visibility was hurting our ability to ship. For example, it became\nclear that the illusion of an attached disk created\n[N + 1 problems][rails-eager-loading]. And even though this is a well-known problem\nin Ruby on Rails, the tools to combat it are all tailored toward using it with\nActiveRecord, Rails' ORM.\n\n[rails-eager-loading]:https://guides.rubyonrails.org/active_record_querying.html#eager-loading-associations\n\n## Nearing v1.0\n\nWith each RPC introduced, v1.0 was getting closer and closer. But how could we be\nsure everything was migrated before unmounting all NFS mount points? A trip\nswitch got introduced, guarding the details required to get to the full path of each\nrepository. Without this data there was no way to execute any Git operation\nthrough NFS. Luckily, the trip switch never went off, so now it was clear NFS\nwasn't being used. The next step was unmounting on our staging environment! Again, this was very\nuneventful. Leaving the volumes unmounted for a full week, and not seeing any\nindication of unexpected errors, the logical next step was our production instance.\n\nDays later we started rolling out these changes to production: first the\nbackground workers were unmounted, than we moved onto higher impact services. At\nthe end of the day, all drives were unmounted without customer impact.\n\n## What's next?\n\nSo, where is this v1.0 tag? We didn't tag it, and I don't think we will. v1.0 is\na state for our Git infrastructure, and a goal for the team, rather than the code base.\nThat being said, the next mental goal is allowing all customers to run without NFS.\nAt the time of writing, some features like administrative tasks, aren't using Gitaly just\nyet. These are slated for [v1.1][gitaly-11], and our next objective.\n\nWant to know more about our Gitaly journey? Read about [how we're making your Git data highly available with Praefect](/blog/high-availability-git-storage-with-praefect/) and [how a fix in Go 1.9 sped up our Gitaly service by 30x](/blog/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/).\n{: .alert .alert-info .text-center}\n\n[gitaly-11]: https://gitlab.com/groups/gitlab-org/-/epics/288\n\nPhoto by [Jason Hafso](https://unsplash.com/photos/8Sjcc4vExpg) on Unsplash\n{: .note}\n",[754,757,9],{"slug":1769,"featured":6,"template":688},"the-road-to-gitaly-1-0","content:en-us:blog:the-road-to-gitaly-1-0.yml","The Road To Gitaly 1 0","en-us/blog/the-road-to-gitaly-1-0.yml","en-us/blog/the-road-to-gitaly-1-0",{"_path":1775,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1776,"content":1782,"config":1788,"_id":1790,"_type":13,"title":1791,"_source":15,"_file":1792,"_stem":1793,"_extension":18},"/en-us/blog/the-ultimate-guide-to-sboms",{"title":1777,"description":1778,"ogTitle":1777,"ogDescription":1778,"noIndex":6,"ogImage":1779,"ogUrl":1780,"ogSiteName":672,"ogType":673,"canonicalUrls":1780,"schema":1781},"The ultimate guide to SBOMs","Learn what a software bill of materials is and why it has become an integral part of modern software development.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749664571/Blog/Hero%20Images/blog-image-template-1800x945__8_.png","https://about.gitlab.com/blog/the-ultimate-guide-to-sboms","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"The ultimate guide to SBOMs\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sandra Gittlen\"}],\n        \"datePublished\": \"2022-10-25\",\n      }",{"title":1777,"description":1778,"authors":1783,"heroImage":1779,"date":1784,"body":1785,"category":925,"tags":1786,"updatedDate":1787},[727],"2022-10-25","In today's rapidly evolving digital landscape, the emphasis on application security within the software supply chain has never been more critical. The integration of upstream dependencies into software requires transparency and security measures that can be complex to implement and manage. This is where a software bill of materials (SBOM) becomes indispensable.\n\nServing as a comprehensive list of ingredients that make up software components, an SBOM illuminates the intricate web of libraries, tools, and processes used across the development lifecycle. Coupled with vulnerability management tools, an SBOM not only reveals potential vulnerabilities in software products but also paves the way for strategic risk mitigation. Our guide dives deep into SBOMs, their pivotal role in a multifaceted [DevSecOps](/topics/devsecops/) strategy, and strategies for improving your application's SBOM health — all aimed at fortifying your organization's cybersecurity posture in a landscape full of emerging threats.\n\nYou'll learn:\n- [What is an SBOM?](#what-is-an-sbom%3F)\n- [Why SBOMs are important](#why-sboms-are-important)\n- [Types of SBOM data exchange standards](#types-of-sbom-data-exchange-standards)\n- [Benefits of pairing SBOMs and software vulnerability management](#benefits-of-pairing-sboms-and-software-vulnerability-management)\n- [GitLab and dynamic SBOMs](#gitlab-and-dynamic-sboms)\n    - [Scale SBOM generation and management](#scale-sbom-generation-and-management)\n    - [Ingest and merge SBOMs](#ingest-and-merge-sboms)\n    - [Accelerate mitigation for better SBOM health](#accelerate-mitigation-for-better-sbom-health)\n    - [Continuous SBOM analysis](#continuous-sbom-analysis)\n    - [Building trust in SBOMs](#building-trust-in-sboms)\n - [The future of GitLab SBOM functionality](#the-future-of-gitlab-sbom-functionality)\n - [Get started with SBOMs](#get-started-with-sboms)\n - [SBOM FAQ](#sbom-faq)\n\n## What is an SBOM?\n\nAn SBOM is a nested inventory or [list of ingredients that make up software components](https://www.cisa.gov/sbom#). In addition to the components themselves, SBOMs include critical information about the libraries, tools, and processes used to develop, build, and deploy a software artifact.\n\nThe SBOM concept has existed [for more than a decade](https://spdx.dev/about/). However, as part of an effort to implement the National Cyber Strategy that the White House released in 2023, [CISA’s Secure by Design framework](https://www.cisa.gov/securebydesign) is helping guide software manufacturers  to adopt secure-by-design principles and integrate cybersecurity into their products. The U.S. government [issued best practices](/blog/comply-with-nist-secure-supply-chain-framework-with-gitlab/) that are driving application developers selling to the public sector to include SBOMs with their software packages. The private sector is not far behind, sending SBOMs on the path to ubiquity. \n\nAlthough SBOMs are often created with stand-alone software, platform companies like GitLab are integrating SBOM generation early and deep in the DevSecOps workflow.\n\n![supply chain security sdlc](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749673653/Blog/Content%20Images/supply_chain_security_sdlc.png)\n\n## Why SBOMs are important\n\nModern software development is laser-focused on delivering applications at a faster pace and in a more efficient manner. This can lead to developers incorporating code from open source repositories or proprietary packages into their applications.  According to Synopsys’s 2024 Open Source Security and Risk Analysis report, which consolidated findings from more than 1,000 commercial codebases across 17 industries in 2023, 96% of the total codebases contained open source and 84% of codebases assessed for risk contained vulnerabilities.\n\nPulling in code from unknown repositories increases the potential for vulnerabilities that can be exploited by hackers. In fact, the [2020 SolarWinds attack](https://www.techtarget.com/whatis/feature/SolarWinds-hack-explained-Everything-you-need-to-know) was sparked by the activation of a malicious injection of code in a package used by SolarWinds’ Orion product. Customers across the software supply chain were significantly impacted. Other attacks, including the log4j vulnerability that impacted a number of commercial software vendors, cemented the need for a deep dive into application dependencies, including containers and infrastructure, to be able to assess [risk throughout the software supply chain](https://about.gitlab.com/blog/the-ultimate-guide-to-software-supply-chain-security/).\n\nThere is also a cost component to finding and remediating a software security vulnerability that levels up the need for SBOMs, as well as damage to a company’s reputation that a software supply chain attack can incur. SBOMs give you insight into your dependencies and can be used to look for vulnerabilities, and licenses that don’t comply with internal policies.\n\n## Types of SBOM data exchange standards\n\nSBOMs work best when their generation and interpretation of information such as name, version, packager, and more are able to be automated. This happens best if all parties use a standard data exchange format.\n\nThere are two main types of SBOM data exchange standards in use today:\n- [OWASP CycloneDX](https://cyclonedx.org/capabilities/sbom/)\n- [SPDX](https://spdx.dev/)\n\nGitLab uses CycloneDX for its SBOM generation because the standard is prescriptive and user-friendly, can simplify complex relationships, and is extensible to support specialized and future use cases. In addition, [cyclonedx-cli](https://github.com/CycloneDX/cyclonedx-cli#convert-command) and [cdx2spdx](https://github.com/spdx/cdx2spdx) are open source tools that can be used to convert CycloneDX files to SPDX if necessary.\n\n## Benefits of pairing SBOMs and software vulnerability management\n\nSBOMs are highly beneficial for DevSecOps teams and software consumers for several reasons:\n* They enable a standard approach to understanding what additional software components are in an application and where they are declared.\n* They provide ongoing visibility into the history of an application’s creation, including details about third-party code origins and host repositories.\n* They provide a deep level of security transparency into both first-party developed code and adopted open source software.\n* The details that SBOMs offer enable a DevOps team to identify vulnerabilities, assess the potential risks, and then mitigate them. \n* SBOMs can deliver the transparency that application purchasers now demand.\n\n## GitLab and dynamic SBOMs\n\nFor SBOMs to be fully impactful, organizations must be able to automatically generate them, connect them with application security scanning tools, integrate the vulnerabilities and licenses into a dashboard for easy comprehension and actionability, and update them continuously. GitLab supports all of these goals.\n\n![Dynamic SBOM management](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749673653/Blog/Content%20Images/Screenshot_2024-05-03_at_10.53.28_AM.png)\n\n### Scale SBOM generation and management\nTo comply with internal policies and regulations, it is key to have accurate and comprehensive SBOMs that cover open source, third-party, and proprietary software. To effectively manage SBOMs for each component and product version, a streamlined process is required for creating, merging, validating and approving SBOMs. GitLab’s [Dependency List feature](https://docs.gitlab.com/ee/user/application_security/dependency_list/) aggregates known vulnerability and license data into a single view within the GitLab user interface. Dependency graph information is also generated as part of the dependency scanning report. This empowers users to gain comprehensive insights into dependencies and risk within their projects or across groups of projects. Additionally, a JSON CycloneDX formatted artifact can be produced in the CI pipeline. This API introduces a more nuanced and customizable approach to SBOM generation. SBOMs are exportable from the UI, a specific pipeline or project, or via the GitLab API. \n\n### Ingest and merge SBOMs\nGitLab can ingest third-party SBOMs, providing a deep level of security transparency into both third-party developed code and adopted open source software. With GitLab, you can use a [CI/CD](https://about.gitlab.com/topics/ci-cd/) job to seamlessly merge multiple CycloneDX SBOMs into a single SBOM. Using implementation-specific details in the CycloneDX metadata of each SBOM, such as the location of build and lock files, duplicate information is removed from the resulting merged file. This data is also augmented automatically with license and vulnerability information for the components inside the SBOM.\n\n### Accelerate mitigation for better SBOM health\nBuilding high-quality products faster requires actionable security findings so developers can address the most critical weaknesses. GitLab helps secure your supply chain by [scanning for vulnerabilities](https://docs.gitlab.com/ee/user/application_security/secure_your_application.html) in source code, containers, dependencies, and running applications. GitLab offers full security scanner coverage from Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST), container scanning, and software composition analysis (SCA) features to help you achieve full coverage against emerging threat vectors.\nTo help developers and security engineers better understand and remediate vulnerabilities more efficiently, [GitLab Duo](https://about.gitlab.com/gitlab-duo/) Vulnerability Explanation, an AI-powered feature, provides an explanation about a specific vulnerability, how it can be exploited, and, most importantly, a recommendation on how to fix the vulnerability. When combined with GitLab Duo Vulnerability Resolution, DevSecOps teams can intelligently identify, analyze, and fix vulnerabilities in just a matter of clicks.\n\nThe platform also supports creation of new policies (and [compliance enforcement](https://docs.gitlab.com/ee/administration/compliance.html)) based on newly detected vulnerabilities. \n\n### Continuous SBOM analysis \nGitLab Continuous Vulnerability Scanning triggers a scan on all projects where either container scanning, dependency scanning, or both, are enabled independent of a pipeline.  When new Common Vulnerabilities and Exposures (CVEs) are reported to the National Vulnerability Database (NVD), users don’t need to re-run their pipelines to get the latest feeds. GitLab’s Vulnerability Research Team adds them to GitLab’s Advisory Database and those advisories are automatically reported up to GitLab as vulnerabilities. This makes GitLab’s SBOM truly dynamic in nature. \n\n### Building trust in SBOMs\nOrganizations that require [compliance functionality](https://about.gitlab.com/solutions/compliance/) can use GitLab to [generate attestation for all build artifacts](/blog/securing-the-software-supply-chain-through-automated-attestation/) produced by the GitLab Runner. The process is secure because it is produced by the GitLab Runner itself with no handoff of data to an external service.\n\n## The future of GitLab SBOM functionality\n\nSoftware supply chain security continues to be a critical topic in the cybersecurity and software industry due to frequent attacks on large software vendors and the focused efforts of attackers on the open source software ecosystem. And although the SBOM industry is evolving quickly, there are still concerns around how SBOMs are generated, the frequency of that generation, where they are stored, how to combine multiple SBOMs for complex applications, how to analyze them, and how to leverage them for application health.\n\nGitLab has made SBOMs an integral part of its [software supply chain direction](https://about.gitlab.com/direction/supply-chain/) and continues to improve upon its SBOM capabilities within the DevSecOps platform, including planning new features and functionality. Recent enhancements to SBOM capabilities include the automation of attestation, digital signing for build artifacts, and support for externally generated SBOMs.\n\nGitLab has also established a robust [SBOM Maturity Model](https://handbook.gitlab.com/handbook/security/security-assurance/dedicated-compliance/sbom-plan/) within the platform that involves steps such as automatic SBOM generation, sourcing SBOMs from the development environment, analyzing SBOMs for artifacts, and advocating for the digital signing of SBOMs. GitLab also plans to add automatic digital signing of build artifacts in future releases. \n\n## Get started with SBOMs\n\nThe demand for SBOMs is already high. Government agencies increasingly recommend or require SBOM creation for software vendors, federal software developers, and even open source communities.\n\n> To get ahead of this requirement, check out the SBOM capabilities for GitLab Ultimate in [GitLab’s DevSecOps platform](https://gitlab.com/-/trials/new).\n\n## SBOM FAQ\n\n**What is an SBOM?**\n\nAn SBOM is a detailed inventory that lists all components, libraries, and tools used in creating, building, and deploying software. This comprehensive list goes beyond mere listings to include vital information about code origins, thus promoting a deeper understanding of an application's makeup and potential vulnerabilities.\n\n**Why are SBOMs important?**\n\nSBOMs are crucial for several reasons. They provide:\n- Insight into dependencies: Understanding what makes up your software helps identify and mitigate risks associated with third-party components.\n- Enhanced security: With detailed visibility into application components, organizations can pinpoint vulnerabilities quickly and take steps to address them.\n- Regulatory compliance: Increasingly, regulations and best practices recommend or require an SBOM for software packages, particularly for those in the public sector.\n- Streamlined development: Developers can lean on an SBOM for insights into used libraries and components, saving time and reducing errors in the development cycle.\n\n**What standards are used for SBOM data exchange?**\n\nThere are two predominant standards:\n- CycloneDX: Known for its user-friendly approach, CycloneDX simplifies complex relationships between software components and supports specialized use cases.\n- SPDX: Another widely used framework for SBOM data exchange, providing detailed information about components within the software environment.\n\nGitLab specifically employs CycloneDX for its SBOM generation because of its prescriptive nature and extensibility to future needs.\n\n**What is GitLab’s approach to SBOMs?**\n\nGitLab emphasizes the creation of dynamic SBOMs that can be:\n- Automatically generated: Ensuring up-to-date information on software composition.\n- Integrated with tools: Connecting to vulnerability scanning tools for thorough risk assessment.\n- Easily managed: Supporting ingestion and merging of SBOMs for comprehensive analysis.\n- Continuously analyzed: Offering ongoing scanning of projects to detect new vulnerabilities as they emerge.\n\n**How can I start implementing SBOMs in my organization?**\n\nFor organizations ready to adopt SBOMs, GitLab’s Ultimate package provides a robust platform for generating and managing SBOMs within a DevSecOps workflow. By leveraging GitLab’s tools, teams can ensure compliance, enhance security, and optimize development practices.\n\nThe increasing demand for SBOMs reflects the growing emphasis on software security and supply chain integrity. By integrating SBOM capabilities, organizations can better protect themselves against vulnerabilities and comply with emerging regulations.\n\n> [Try GitLab Ultimate free for 30 days today.](https://about.gitlab.com/free-trial/devsecops/)\n\n_Disclaimer This blog contains information related to upcoming products, features, and functionality. It is important to note that the information in this blog post is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned in this blog and linked pages are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab._",[925,758,9,708,183],"2024-05-02",{"slug":1789,"featured":6,"template":688},"the-ultimate-guide-to-sboms","content:en-us:blog:the-ultimate-guide-to-sboms.yml","The Ultimate Guide To Sboms","en-us/blog/the-ultimate-guide-to-sboms.yml","en-us/blog/the-ultimate-guide-to-sboms",{"_path":1795,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1796,"content":1802,"config":1807,"_id":1809,"_type":13,"title":1810,"_source":15,"_file":1811,"_stem":1812,"_extension":18},"/en-us/blog/three-steps-to-optimize-software-value-streams",{"title":1797,"description":1798,"ogTitle":1797,"ogDescription":1798,"noIndex":6,"ogImage":1799,"ogUrl":1800,"ogSiteName":672,"ogType":673,"canonicalUrls":1800,"schema":1801},"GitLab's 3 steps to optimizing software value streams","Discover the power of GitLab Value Streams Dashboard (VSD) for optimizing software delivery workflows.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749667893/Blog/Hero%20Images/workflow.jpg","https://about.gitlab.com/blog/three-steps-to-optimize-software-value-streams","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"GitLab's 3 steps to optimizing software value streams\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Haim Snir\"}],\n        \"datePublished\": \"2023-06-26\",\n      }",{"title":1797,"description":1798,"authors":1803,"heroImage":1799,"date":1804,"body":1805,"category":730,"tags":1806},[818],"2023-06-26","\n\n\u003Ci>This is part three of our multipart series introducing you to the capabilities within GitLab Value Stream Management and the Value Streams Dashboard. In part one, [learn about the Total Time Chart](https://about.gitlab.com/blog/value-stream-total-time-chart/) and how to simplify top-down optimization flow with Value Stream Management. In part two, learn how to [get started with the Value Streams Dashboard](https://about.gitlab.com/blog/getting-started-with-value-streams-dashboard/). \u003C/i>\n\nIt’s no news that software development is a complex process that involves many different stages, teams, and tools. With significant investments made in digital transformation and adopting new tools following the shift to remote work, measuring and managing the business value of the software development lifecycle (SDLC) have become more complex.\n\nThis is where Value Stream Management (VSM) comes in. VSM is a methodology that helps organizations optimize their software delivery process by visualizing, measuring, and improving the flow of value (a.k.a. the “value stream”) from ideation to production. Some examples are: the amount of time it takes to go from an idea to production, the velocity of the project, bottlenecks in the development process, and long-running issues or merge requests. As you’ve probably guessed from its title, this blog will cover how the [new capabilities of GitLab Value Streams Dashboard](https://about.gitlab.com/releases/2023/05/22/gitlab-16-0-released/#value-streams-dashboard-is-now-generally-available) can help you do all that, and optimize your software delivery.\n\n## Value Stream Management in a nutshell \nGitLab [VSM](https://about.gitlab.com/solutions/value-stream-management/) provides end-to-end visibility into your software delivery process. It enables you to [map out your value stream](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#create-a-value-stream-with-custom-stages), identify bottlenecks, measure key metrics, and identify the places where you are either lagging or doing exceptionally well. It then also allows you to take action on these insights. In essence, GitLab VSM helps you to understand and optimize your development processes to deliver software faster and better.\n\n![GitLab Value Stream Analytics](https://about.gitlab.com/images/blogimages/2023-05-24-vsm-overview.png){: .shadow}\nWith Value Stream Analytics, you can establish a baseline for measuring software delivery performance progress and identifying the touchpoints in the process that do not add value to the customer or your business.\n{: .note.text-center}\n\nAnd if you’re wondering how GitLab VSM is able to do that, it’s because GitLab provides an entire DevSecOps platform as a single application and, therefore, holds all the data needed to provide end-to-end visibility throughout the entire SDLC. So now, your decisions rely on actual data rather than blind estimation or gut feelings. Additionally, since GitLab is the place where work happens, these insights are also actionable, allowing your users to move from “understanding” to “fixing” at any time, from within their workflow and without losing context.\n\n## How VSM works: The three-step analysis\nLet’s take a look at how GitLab VSM helps you optimize your SDLC in three easy steps:\n\n**Step 1:** Get an end-to-end view across your entire organization and pinpoint the value streams you need to focus on.\n\nThe [Value Streams Dashboard](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html) is a centralized view where you can see and compare all of the SDLC metrics of all your organization's projects. This dashboard enables you to identify hotspots in your SDLC streams — projects or teams that are underperforming, with longer stages and cycle times. It also shows you where you have the largest value contributors, so you can identify and learn what is working well and what's not. With this information at hand, you can now prioritize your efforts and understand where to spend your time.\n\n![VSM illustration](https://about.gitlab.com/images/blogimages/2023-05-24_vsm1.gif){: .shadow}\n\n\nThis centralized UI acts as a single source of truth for your organization, where all the relevant stakeholders can access, view, and analyze the same set of metrics. This ensures everyone is on the same page, promoting consistency in analysis and decision-making.\n\nRead more: [Getting started with the new GitLab Value Streams Dashboard](https://about.gitlab.com/blog/getting-started-with-value-streams-dashboard/)\n\n**Step 2:** Drill down into a specific project.\n\nWhen you select a project from the main dashboard, you are directed to that project's Value Stream Analytics (VSA), where you see its value stream. The project's metrics are presented for each stage of the project, helping you understand where the main work lies and which stages need improvement. The VSA overview provides valuable insights into lead times, cycle times, and other critical metrics that help you identify areas for optimization.\n\n![VSM illustration](https://about.gitlab.com/images/blogimages/2023-05-24_vsm2.gif){: .shadow}\n\n\nRead more: [Value stream management: Total Time Chart simplifies top-down optimization flow](https://about.gitlab.com/blog/value-stream-total-time-chart/)\n\n**Step 3:** Dive deep into the Value Stream Analytics dashboard to analyze and fix issues.\n\nOnce the main areas of interest are identified, GitLab Value Stream Analytics (VSA) enables you to drill down further into a specific stage of the project. In the stage table, you can sort the **Last event** column to view the most recent workflow event, and sort the items by **duration** so you can rearrange the events and gain insights faster. This way, you can easily detect work items that are slowing down the project in that stage. Here's an example how we dogfood [VSA on gitlab-org](https://gitlab.com/gitlab-org/gitlab/-/value_stream_analytics). \n\nYou can identify the owner of the work items responsible for the delays, examine code changes, and perform a comprehensive analysis of the issue. This level of visibility and traceability empowers you to take targeted actions and make the necessary improvements to optimize the value stream, all within the context of your current workflow.\n\n![VSM illustration](https://about.gitlab.com/images/blogimages/2023-05-24_vsm3.gif){: .shadow}\nUse GitLab Value Stream Management to visualize the progress of work from planning to value delivery, and gain actionable context.\n{: .note.text-center}\n\n## The value of Value Stream Management\nGitLab VSM is a powerful solution that fits seamlessly into your SDLC. By providing end-to-end visibility and granular, actionable insights into the value stream, VSM enables you to optimize your software delivery and provide value to your customers faster. Access the information you need, when you need it — and easily act on it from within your workplace. VSM offers you the best of both worlds: out-of-the-box functionality and the ability to customize features.\n\nSay goodbye to time-consuming searches and hello to instant access to the information you need most. To learn more, check out the [Value Stream Analytics documentation](https://docs.gitlab.com/ee/user/analytics/value_streams_dashboard.html).\n\nTo help us improve the Value Stream Management, please share feedback about your experience in this [survey](https://gitlab.fra1.qualtrics.com/jfe/form/SV_50guMGNU2HhLeT4).\n",[843,707,823,9,732],{"slug":1808,"featured":6,"template":688},"three-steps-to-optimize-software-value-streams","content:en-us:blog:three-steps-to-optimize-software-value-streams.yml","Three Steps To Optimize Software Value Streams","en-us/blog/three-steps-to-optimize-software-value-streams.yml","en-us/blog/three-steps-to-optimize-software-value-streams",{"_path":1814,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1815,"content":1821,"config":1826,"_id":1828,"_type":13,"title":1829,"_source":15,"_file":1830,"_stem":1831,"_extension":18},"/en-us/blog/three-teams-left-jenkins-heres-why",{"title":1816,"description":1817,"ogTitle":1816,"ogDescription":1817,"noIndex":6,"ogImage":1818,"ogUrl":1819,"ogSiteName":672,"ogType":673,"canonicalUrls":1819,"schema":1820},"3 Teams left Jenkins: Here’s why","How three different teams – Alteryx, ANWB, and EAB – shifted away from Jenkins for smoother sailing with GitLab.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749671932/Blog/Hero%20Images/jenkins-to-gitlab-sailboat.jpg","https://about.gitlab.com/blog/three-teams-left-jenkins-heres-why","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"3 Teams left Jenkins: Here’s why\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Brein Matturro\"}],\n        \"datePublished\": \"2019-07-23\",\n      }",{"title":1816,"description":1817,"authors":1822,"heroImage":1818,"date":1823,"body":1824,"category":1104,"tags":1825},[703],"2019-07-23","\nAs many companies know, continuous integration and build processes are challenging. Complex tool\nintegrations, pieced-together pipelines, and overall system breakdowns are time consuming for\neven the most experienced teams. The longer it takes for system recovery, the more costly it\nbecomes, creating more risk for the organization as a whole. Competitive companies are always on\nthe lookout for better solutions and they're increasingly turning to GitLab to do just that.\n\nThree companies – Alteryx, ANWB, and EAB – all experienced unique challenges with Jenkins.\nWe highlight how each of these teams made the successful move to\n[GitLab from Jenkins](/solutions/jenkins/). Learn how each team\naccelerated deployment, improved CI/CD pipelines, created developer transparency, and\nalleviated toolchain stressors after making the switch to GitLab.\n\n## Alteryx: Builds down from 3 hours to 30 minutes\n\nAlteryx, a prominent end-to-end analytics platform, was using a legacy system with Jenkins\nthat was older, clunky, and difficult to manage. The team was looking to modernize their architecture\nand to improve their overall software development lifecycle.\n\nThey turned to GitLab because it offers many solutions in one tool. With GitLab, the Alteryx team is now\ncapable of managing source code, CI/CD, code reviews, and security scanning all in one place.\nA build that took three hours with Jenkins is now just 30 minutes in GitLab.\n\nAs Alteryx continues to grow in the analytics space, GitLab will continue to add new features\nto support the company's expanding needs. Learn more about [Alteryx’s journey](/customers/alteryx/).\n\n## ANWB: Increased deployments\n\nWith over 4.4 million members, ANWB offers services for credit cards, bicycle maintenance,\ncar sales, and travel throughout the Netherlands. Both the mobile and web development\nteams have their hands full with popular offerings like mapping and driver intelligence services.\n\nANWB was struggling with an outdated toolchain that included Jenkins version 1 as a build server.\nThe company wanted to speed up development, eliminate isolated and outdated processes and give\nits teams autonomy.\n\nWith GitLab, ANWB can now manage separate teams, increase deployments, and support a culture\nwhere everyone contributes freely to colleagues' code repositories. ANWB has plans to move toward a\ncloud-centric framework and GitLab has helped to pave that road. Learn more about [ANWB’s path to success](/customers/anwb/).\n\n## EAB: \"Quality first\" culture\n\nServing over 1,500 schools, colleges, and universities, EAB uses data analytics and transformative\nmeasures to help students stay enrolled in education. The EAB team had to rely on several tools,\nincluding Jenkins, which made continuous integration overly complex and time consuming.\nDevelopers wanted to consolidate their various tools to create faster builds with much less maintenance.\n\nEAB initially turned to GitLab because of our regular feature releases and [tiered (and affordable) pricing](/pricing/).\nThe EAB development team soon realized they could have a steady pace of\nbuild releases without having to use multiple tools to make it happen. In just six months, workflow increased\nand the company plans to continue to roll out a \"quality first\" culture using GitLab as a guide.\n\n\u003Ci class=\"fab fa-gitlab\" style=\"color:rgb(107,79,187); font-size:.85em\" aria-hidden=\"true\">\u003C/i>&nbsp;&nbsp;\nWatch the [Migrating from Jenkins to GitLab](https://www.youtube.com/watch?v=RlEVGOpYF5Y) demo\n&nbsp;&nbsp;\u003Ci class=\"fab fa-gitlab\" style=\"color:rgb(107,79,187); font-size:.85em\" aria-hidden=\"true\">\u003C/i>\n{: .alert .alert-webcast}\n\nCover image by [Fab Lentz](https://unsplash.com/@fossy) on [Unsplash](https://unsplash.com)\n{: .note}\n",[1509,9,108],{"slug":1827,"featured":6,"template":688},"three-teams-left-jenkins-heres-why","content:en-us:blog:three-teams-left-jenkins-heres-why.yml","Three Teams Left Jenkins Heres Why","en-us/blog/three-teams-left-jenkins-heres-why.yml","en-us/blog/three-teams-left-jenkins-heres-why",{"_path":1833,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1834,"content":1840,"config":1846,"_id":1848,"_type":13,"title":1849,"_source":15,"_file":1850,"_stem":1851,"_extension":18},"/en-us/blog/tips-for-managing-monorepos-in-gitlab",{"title":1835,"description":1836,"ogTitle":1835,"ogDescription":1836,"noIndex":6,"ogImage":1837,"ogUrl":1838,"ogSiteName":672,"ogType":673,"canonicalUrls":1838,"schema":1839},"5 Tips for managing monorepos in GitLab","Learn the benefits of operating a monolothic repository and how to get the most out of this structure.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749667591/Blog/Hero%20Images/code-review-blog.jpg","https://about.gitlab.com/blog/tips-for-managing-monorepos-in-gitlab","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"5 Tips for managing monorepos in GitLab\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sarah Waldner\"}],\n        \"datePublished\": \"2022-07-12\",\n      }",{"title":1835,"description":1836,"authors":1841,"heroImage":1837,"date":1843,"body":1844,"category":681,"tags":1845},[1842],"Sarah Waldner","2022-07-12","\nGitLab was founded 10 years ago on Git because it is the market leading version control system. As [Marc Andressen pointed out in 2011](https://www.wsj.com/articles/SB10001424053111903480904576512250915629460), we see teams and code bases expanding at incredible rates, testing the limits of Git. Organizations are experiencing significant slowdowns in performance and added administration complexity working on enormous repositories or monolithic repositories. \n\n## Why do organizations develop on monorepos? \n\nGreat question. While [some](https://www.infoworld.com/article/3638860/the-case-against-monorepos.html) might believe that monorepos are a no-no, there are valid reasons why companies, including  Google or GitLab (that’s right! We operate a monolithic repository), choose to do so. The main benefits are: \n\n- Monorepos can reduce silos between teams, streamlining collaboration on design, development, and operation of different services because everything is within the same repository.\n- Monorepos help organizations standardize on tooling and processes. If a company is pursuing a DevOps transformation, a monorepo can help accelerate change management when it comes to new workflows or the rollout of new tools.\n- Monorepos simplify dependency management because all packages can be updated in a single commit.\n- Monorepos offer unified CI/CD and build processes. Having all services in a single repository means that you can set up one system of pipelines for everyone.\n\nWhile we still have a ways to go before monorepos or monolithic repositories are as easy to manage as multi-repos in GitLab, we put together five tips and tricks to maintain velocity while developing on a monorepo in GitLab.\n\n**1. Use CODEOWNERS to streamline merge request approvals**\n\nCODEOWNERS files live in the repository and assign an owner to a portion of the code, making it super efficient to process changes. Investing time in setting up a robust [CODEOWNERS file](https://docs.gitlab.com/ee/user/project/codeowners/) that you can then use to automate [merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/) from required people will save time down the road for developers. \n\nYou can then set your merge requests so they must be approved by Code Owners before merge. CODEOWNERS specified for the changed files in the merge request will be automatically notified.\n\n**2. Improve git operation performance with Git LFS**\n\nA universal truth of git is that managing large files is challenging. If you work in the gaming industry, I am sure you’ve been through the annoying process of trying to remove a binary file from the repository history after a well-meaning coworker committed it. This is where [Git LFS](https://docs.gitlab.com/ee/topics/git/lfs/#git-large-file-storage-lfs) comes in. Git LFS keeps all the big files in a different location so that they do not exponentially increase the size of a repository.\n\nThe GitLab server communicates with the Git LFS client over HTTPS. You can enable Git LFS for a project by toggling it in [project settings](https://docs.gitlab.com/ee/user/project/settings/index.html#configure-project-visibility-features-and-permissions). All files in Git LFS can be tracked in the GitLab interface. GitLab indicates what files are stored there with the LFS icon.\n\n**3. Reduce download time with partial clone operations**\n\n[Partial clone](https://docs.gitlab.com/ee/topics/git/partial_clone.html#partial-clone) is a performance optimization that allows Git to function without having a complete copy of the repository. The goal of this work is to allow Git to better handle extremely large repositories.\n\nAs we just talked about, storing large binary files in Git is normally discouraged, because every large file added is downloaded by everyone who clones or fetches changes thereafter. These downloads are slow and problematic, especially when working from a slow or unreliable internet connection.\n\nUsing partial clone with a file size filter solves this problem, by excluding troublesome large files from clones and fetches. \n\n**4. Take advantage of parent-child pipelines**\n\n[Parent-child pipelines](https://docs.gitlab.com/ee/ci/pipelines/downstream_pipelines.html) are where one pipeline triggers a set of downstream pipelines in the same project. The downstream pipelines still execute in the same stages or sequence without waiting for other pipelines to finish. Additionally, child pipelines reduce the configuration to the child pipeline, making it easier to interpret and understand. For monorepos, using parent-child pipelines in conjunction with `rules:changes` will only run pipelines on specified files changes. This reduces wasted time running pipelines across the entire repository.  \n\n**5. Use incremental backups to eliminate downtime** \n\n[Incremental backups](https://docs.gitlab.com/ee/raketasks/backup_restore.html#incremental-repository-backups) can be faster than full backups because they only pack changes since the last backup into the backup bundle for each repository. This is super useful when you are working on a large repository and only developing on certain parts of the code base at a time.\n\n## Where we are headed\n\nWhile these tips have helped many customers migrate from other version control systems to GitLab, we know there is still room for improvement. Over the next year, you will see us working on the following projects. We’d LOVE to hear from you, so share your thoughts, ideas, or simply 👍 on an issue to help prioritize things that will make your life easier.\n\n- [Git for enormous repositories](https://gitlab.com/groups/gitlab-org/-/epics/773)\n- [Expand SAST scanner support for monorepos](https://gitlab.com/groups/gitlab-org/-/epics/4895)\n- [Allow Reports to be Namespace to support monorepos](https://gitlab.com/gitlab-org/gitlab/-/issues/299490)\n",[707,823,9,781,732],{"slug":1847,"featured":6,"template":688},"tips-for-managing-monorepos-in-gitlab","content:en-us:blog:tips-for-managing-monorepos-in-gitlab.yml","Tips For Managing Monorepos In Gitlab","en-us/blog/tips-for-managing-monorepos-in-gitlab.yml","en-us/blog/tips-for-managing-monorepos-in-gitlab",{"_path":1853,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1854,"content":1859,"config":1865,"_id":1867,"_type":13,"title":1868,"_source":15,"_file":1869,"_stem":1870,"_extension":18},"/en-us/blog/tyranny-of-the-clock",{"title":1855,"description":1856,"ogTitle":1855,"ogDescription":1856,"noIndex":6,"ogImage":1199,"ogUrl":1857,"ogSiteName":672,"ogType":673,"canonicalUrls":1857,"schema":1858},"6 Lessons we learned when debugging a scaling problem on GitLab.com","Get a closer look at how we investigated errors originating from scheduled jobs, and how we stumbled upon \"the tyranny of the clock.\"","https://about.gitlab.com/blog/tyranny-of-the-clock","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"6 Lessons we learned when debugging a scaling problem on GitLab.com\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Craig Miskell\"}],\n        \"datePublished\": \"2019-08-27\",\n      }",{"title":1855,"description":1856,"authors":1860,"heroImage":1199,"date":1862,"body":1863,"category":681,"tags":1864},[1861],"Craig Miskell","2019-08-27","\nHere is a story of a scaling problem on GitLab.com: How we found it, wrestled with it, and ultimately resolved it. And how we discovered the tyranny of the clock.\n\n## The problem\n\nWe started receiving reports from customers that they were intermittently seeing errors on Git pulls from GitLab.com, typically from CI jobs or similar automated systems. The reported error message was usually:\n```\nssh_exchange_identification: connection closed by remote host\nfatal: Could not read from remote repository\n```\nTo make things more difficult, the error message was intermittent and apparently unpredictable. We weren't able to reproduce it on demand, nor identify any clear indication of what was happening in graphs or logs. The error message wasn't particularly helpful either; the SSH client was being told the connection had gone away, but that could be due to anything: a flaky client or VM, a firewall we don't control, an ISP doing something strange, or an application problem at our end. We deal with a *lot* of connections to Git-over-SSH, in the order of ~26 million a day, or 300/s average, so trying to pick out a small number of failing ones out of that firehose of data was going to be difficult. It's a good thing we like a challenge.\n\n## The first clue\n\nWe got in touch with one of our customers (thanks Hubert Hölzl from Atalanda) who was seeing the problem several times a day, which gave us a foothold. Hubert was able to supply the relevant public IP address, which meant we could run some packet captures on our frontend HAproxy nodes, to attempt to isolate the problem from a smaller data set than 'All of the SSH traffic.' Even better, they were using the [alternate-ssh port](/blog/gitlab-dot-com-now-supports-an-alternate-git-plus-ssh-port/) which means we only had two HAProxy servers to look at, not 16.\n\nTrawling through these packet traces was still not fun; despite the constraints, there was ~500MB of packet capture from about 6.5 hours. We found the short-running connections, in which the TCP connection was established, the client sent a version string identifier, and then our HAProxy immediately tore down the connection with a proper TCP FIN sequence. This was the first great clue. It told us that it was definitely the GitLab.com end that was closing the connection, not something in between the client and us, meaning this was a problem we could debug.\n\n### Lesson #1: In Wireshark, the Statistics menu has a wealth of useful tools that I'd never really noticed until this endeavor.\n\nIn particular, 'Conversations' shows you a basic breakdown of time, packets, and bytes for each TCP connection in the capture, which you can sort. I *should* have used this at the start, instead of trawling through the captures manually. In hindsight, connections with small packet counts was what I was looking for, and the Conversations view shows this easily. I was then able to use this feature to find other instances, and verify that the first instance I found was not just an unusual outlier.\n\n## Diving into logs\n\nSo what was causing HAProxy to tear down the connection to the client? It certainly seemed unlikely that it was doing it arbitrarily, and there must be a deeper reason; another layer of [turtles](https://en.wikipedia.org/wiki/Turtles_all_the_way_down), if you will. The HAProxy logs seemed like the next place to check. Ours are stored/available in GCP BigQuery, which is handy because there's a lot of them, and we needed to slice 'n dice them in lots of different ways. But first, we were able to identify the log entry for one of the incidents from the packet capture, based on time and TCP ports, which was a major breakthrough. The most interesting detail in that entry was the `t_state` (Termination State) attribute, which was `SD`. From the HAProxy documentation:\n```\n    S: aborted by the server, or the server explicitly refused it\n    D: the session was in the DATA phase.\n```\n`D` is pretty clear; the TCP connection had been properly established, and data was being sent, which matched the packet capture evidence. The `S` means HAProxy received an RST, or an ICMP failure message from the backend. There was no immediate clue as to which case was occurring or possible causes. It could be anything from a networking issue (e.g. glitch or congestion) to an application-level problem. Using BigQuery to aggregate by the Git backends, it was clear it wasn't specific to any VM. We needed more information.\n\nSide note: It turned out that logs with `SD` weren't unique to the problem we were seeing. On the alternate-ssh port we get a lot of scanning for HTTPS, which leads to `SD` being logged when the SSH server sees a TLS ClientHello message while expecting an SSH greeting. This created a brief detour in our investigation.\n\nOn capturing some traffic between HAProxy and the Git server and using the Wireshark statistics tools again, it was quickly obvious that SSHD on the Git server was tearing down the connection with a TCP FIN-ACK immediately after the TCP three-way handshake; HAProxy still hadn't sent the first data packet but was about to, and when it did very shortly after, the Git server responded with a TCP RST. And thus we had the reason for HAProxy to log a connection failure with `SD`. SSH was closing the connection, apparently deliberately and cleanly, with the RST being just an artifact of the SSH server receiving a packet after the FIN-ACK, and doesn't mean anything else here.\n\n## An illuminating graph\n\nWhile watching and analyzing the `SD` logs in BigQuery, it became apparent that there was quite a bit of clustering going on in the time dimension, with spikes in the first 10 seconds after the top of each minute, peaking at about 5-6 seconds past:\n\n![Connection errors grouped by second](https://gitlab.com/gitlab-com/gl-infra/infrastructure/uploads/72cd1b763c51781fa4224495f059afb5/image.png){: .shadow.medium.center}\nConnection errors, grouped by second-of-the-minute\n{: .note.text-center}\n\nThis graph is created from data collated over a number of hours, so the fact that the pattern is so substantial suggests the cause is consistent across minutes and hours, and possibly even worse at specific times of the day. Even more interesting, the average spike is 3x the base load, which means we have a fun scaling problem and simply provisioning 'more resource' in terms of VMs to meet the peak loads would potentially be prohibitively expensive. This also suggested that we were hitting some hard limit, and was our first clue to an underlying systemic problem, which I have called \"the tyranny of the clock.\"\n\nCron, or similar scheduling systems, often don't have sub-minute accuracy, and if they do, it isn't used very often because humans prefer to think about things in round numbers. Consequently, jobs will run at the start of the minute or hour or at other nice round numbers. If they take a couple of seconds to do any preparations before they do a `git fetch` from GitLab.com, this would explain the connection pattern with increases a few seconds into the minute, and thus the increase in errors around those times.\n\n### Lesson #2: Apparently a lot of people have time synchronization (via NTP or otherwise) set up properly.\n\nIf they hadn't, this problem wouldn't have emerged so clearly. Yay for NTP!\n\nSo what could be causing SSH to drop the connection?\n\n## Getting close\n\nLooking through the documentation for SSHD, we found MaxStartups, which controls the maximum number of connections that can be in the pre-authenticated state. At the top of the minute, under the stampeding herd of scheduled jobs from around the internet, it seems plausible that we were exceeding the connections limit. MaxStartups actually has three numbers: the low watermark (the number at which it starts dropping connections), a percentage of connections to (randomly) drop for any connections above the low watermark, and an absolute maximum above which all new connections are dropped. The default is 10:30:100, and our setting at this time was 100:30:200, so clearly we had increased the connections in the past. Perhaps it was time to increase it again.\n\nSomewhat annoyingly, the version of openssh on our servers is 7.2, and the only way to see that MaxStartups is being breached in that version is to turn on Debug level logging. This is an absolute firehose of data, so we carefully turned it on for a short period on only one server. Thankfully within a couple of minutes it was obvious that MaxStartups was being breached, and connections were being dropped early as a result,.\n\nIt turns out that OpenSSH 7.6 (the version that comes with Ubuntu 18.04) has better logging about MaxStartups; it only requires Verbose logging to get it. While not ideal, it's better than Debug level.\n\n### Lesson #3: It is polite to log interesting information at default levels and deliberately dropping a connection for any reason is definitely interesting to system administrators.\n\nSo now that we have a cause for the problem, how can we address it? We can bump MaxStartups, but what will that cost? Definitely a small bit of memory, but would it cause any untoward downstream effects? We could only speculate, so we had to just try it. We bumped the value to 150:30:300 (a 50% increase). This had a great positive effect, and no visible negative effect (such as increased CPU load):\n\n![Before and after graph](https://gitlab.com/gitlab-com/gl-infra/production/uploads/047a4859caafc6681c9d034c202418b9/image.png){: .shadow.medium.center}\n\nBefore and after bumping MaxStartups by 50%\n{: .note.text-center}\n\nNote the substantial reduction after 01:15. We've clearly eliminated a large proportion of the errors, although a non-trivial amount remained. Interestingly, these are clustered around round numbers: the top of the hour, every 30 minutes, 15 minutes, and 10 minutes. Clearly the tyranny of the clock continues. The top of the hour saw the biggest peaks, which seems reasonable in hindsight; a lot of people will simply schedule their jobs to run every hour at 0 minutes past the hour. This finding was more evidence that confirms our theory that it was scheduled jobs causing the spikes, and that we were on the right path with this error being due to a numerical limit.\n\nDelightfully, there were no obvious negative effects. CPU usage on the SSH servers stayed about the same and didn't cause any noticeable increase in load. Even though we were unleashing more connections that would previously have been dropped, and doing so at the busiest times. This was promising.\n\n## Rate limiting\n\nAt this point we weren't keen on simply bumping MaxStartups higher; while our 50% increase to-date had worked, it felt pretty crude to keep on pushing this arbitrarily higher. Surely there was something else we could do.\n\nMy search took me to the HAProxy layer that we have in front of the SSH servers. HAProxy has a nice 'rate-limit sessions' option for its frontend listeners. When configured, it constrains the new TCP connections per-second that the frontend will pass through to backends, and leaves additional incoming connections on the TCP socket. If the incoming rate exceeds the limit (measured every millisecond) the new connections are simply delayed. The TCP client (SSH in this case) simply sees a delay before the TCP connection is established, which is delightfully graceful, in my opinion. As long as the overall rate never spiked too high above the limit for too long, we'd be fine.\n\nThe next question was what number we should use. This is complicated by the fact that we have 27 SSH backends, and 18 HAproxy frontends (16 main, two alt-ssh), and the frontends don't coordinate amongst themselves for this rate limiting. We also had to take into account how long it takes a new SSH session to make it past authentication: Assuming MaxStartups of 150, if the auth phase took two seconds we could only send 75 new sessions per second to the each backend. The [note on the issue](https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7168#note_191678023) has the derivation of the math, and I won't recount it in detail here, except to note that there are four quantities needed to calculate the rate-limit: the counts of both server types, the value of MaxStartups, and `T`, which is how long the SSH session takes to auth. `T` is critical, but we could only estimate it. You might speculate how well I did at this estimate, but that would spoil the story. I went with two seconds for now, and came to a rate limit per frontend of approximately 112.5, and rounded down to 110.\n\nWe deployed. Everything was happy, yes? Errors tended to zero, and children danced happily in the streets? Well, not so much. This change had no visible effect on the error rates. I will be honest here, and say I was rather distressed. We had missed something important, or misunderstood the problem space entirely.\n\nSo we went back to logs (and eventually the HAProxy metrics), and were able to verify that the rate limiting was at least working to limit to the number we specified, and that historically this number had been higher, so we were successfully constraining the rate at which connections were being dispatched. But clearly the rate was still too high, and not only that, it wasn't even *close* enough to the right number to have a measurable impact. Looking at the selection of backends (as logged by HAproxy) showed an oddity: At the top of the hour, the backend connections were not evenly distributed across all the SSH servers. In the sample time chosen, it varied from 30 to 121 in a given second, meaning our load balancing wasn't very balanced. Reviewing the configuration showed we were using `balance source`, so that a given client IP address would always connect to the same backend. This might be good if you needed session stickiness, but this is SSH and we have no such need. It was deliberately chosen some time ago, but there was no record as to why. We couldn't come up with a good reason to keep it, so we tried changing to leastconn, which distributes new incoming connections to the backend with the least number of current connections. This was the result, of the CPU usage on our SSH (Git) fleet:\n\n![Leastconn before and after](https://gitlab.com/gitlab-com/gl-infra/infrastructure/uploads/b006877c1e45ad0255a316a96750402c/before-after-leastconn-change.png){: .shadow.medium.center}\n\nBefore and after turning on leastconn\n{: .note.text-center}\n\nClearly leastconn was a good idea. The two low-usage lines are our [Canary](/handbook/engineering/infrastructure/library/canary/) servers and can be ignored, but the spread on the others before the change was 2:1 (30% to 60%), so clearly some of our backends were much busier than others due to the source IP hashing. This was surprising to me; it seemed reasonable to expect the range of client IPs to be sufficient to spread the load much more evenly, but apparently a few large outliers were enough to skew the usage significantly.\n\n### Lesson #4: When you choose specific non-default settings, leave a comment or link to documentation/issues as to why, future people will thank you.\n\n This transparency is [one of GitLab's core values](https://handbook.gitlab.com/handbook/values/#say-why-not-just-what).\n\nTurning on leastconn also helped reduce the error rates, so it is something we wanted to continue with. In the spirit of experimenting, we dropped the rate limit lower to 100, which further reduced the error rate, suggesting that perhaps the initial estimate for `T` was wrong. But if so, it was too small, leading to the rate limit being too high, and even 100/s felt pretty low and we weren't keen to drop it further. Unfortunately for some operational reasons these two changes were just an experiment, and we had to roll back to `balance source` and rate limit of 100.\n\nWith the rate limit as low as we were comfortable with, and leastconn insufficient, we tried increasing MaxStartups: first to 200 with some effect, then to 250. Lo, the errors all but disappeared, and nothing bad happened.\n\n### Lesson #5: As scary as it looks, MaxStartups appears to have very little performance impact even if it's raised much higher than the default.\n\nThis is probably a large and powerful lever we can pull in future, if necessary. It's possible we might notice problems if it gets into the thousands or tens of thousands, but we're a long way from that.\n\nWhat does this say about my estimate for `T`, the time to establish and authenticate an SSH session? Reverse engineering the equation, knowing that 200 wasn't quite enough for MaxStartups, and 250 is enough, we could calculate that `T` is probably between 2.7 and 3.4 seconds. So the estimate of two seconds wasn't far off, but the actual value was definitely higher than expected. We'll come back to this a bit later.\n\n## Final steps\n\nLooking at the logs again in hindsight, and after some contemplation, we discovered that we could identify this specific failure with t_state being `SD` and b_read (bytes read by client) of 0. As noted above, we handle approximately 26-28 million SSH connections per day. It was unpleasant to discover that at the worst of the problem, roughly 1.5% of those connections were being dropped badly. Clearly the problem was bigger than we had realised at the start. There was nothing about this that we couldn't have identified earlier (right back when we discovered that t_state=\"SD\" was indicative of the issue), but we didn't think to do so, and we should have. It might have increased how much effort we put in.\n\n### Lesson #6: Measure the actual rate of your errors as early as possible.\n\nWe might have put a higher priority on this earlier had we realized the extent of the problem, although it was still dependent on knowing the identifying characteristic.\n\nOn the plus side, after our bumps to MaxStartups and rate limiting, the error rate was down to 0.001%, or a few thousand per day. This was better, but still higher than we liked. After we unblocked some other operational matters, we were able to formally deploy the leastconn change, and the errors were eliminated entirely. We could breathe easy again.\n\n## Further work\n\nClearly the SSH authentication phase is still taking quite a while, perhaps up to 3.4 seconds. GitLab can use [AuthorizedKeysCommand](https://docs.gitlab.com/ee/administration/operations/fast_ssh_key_lookup.html) to look up the SSH key directly in the database. This is critical for speedy operations when you have a large number of users, otherwise SSHD has to sequentially read a very large `authorized_keys` file to look up the public key of the user, and this doesn't scale well. We implement the lookup with a little bit of ruby that calls an internal HTTP API. [Stan Hu](/company/team/#stanhu), engineering fellow and our resident source of GitLab knowledge, identified that the unicorn instances on the Git/SSH servers were experiencing substantial queuing. This could be a significant contributor to the ~3-second pre-authentication stage, and therefore something we need to look at further, so investigations continue. We may increase the number of unicorn (or puma) workers on these nodes, so there's always a worker available for SSH. However, that isn't without risk, so we will need to be careful and measure well. Work continues, but slower now that the core user problem has been mitigated. We may eventually be able to reduce MaxStartups, although given the lack of negative impact it seems to have, there's little need. It would make everyone more comfortable if OpenSSH let us see the how close we were to hitting MaxStartups at any point, rather than having to go in blind and only find out we were close when the limit is breached and connections are dropped.\n\nWe also need to alert when we see HAProxy logs that indicate the problem is occurring, because in practice there's no reason it should ever happen. If it does, we need to increase MaxStartups further, or if resources are constrained, add more Git/SSH nodes.\n\n## Conclusion\n\nComplex systems have complex interactions, and there is often more than one lever that can be used to control various bottlenecks. It's good to know what tools are available because they often have trade-offs. Assumptions and estimates can also be risky. In hindsight, I would have attempted to get a much better measurement of how long authentication takes, so that my `T` estimate was better.\n\nBut the biggest lesson is that when large numbers of people schedule jobs at round numbers on the clock, it leads to really interesting scaling problems for centralized service providers like GitLab. If you're one of them, you might like to consider putting in a random sleep of maybe 30 seconds at the start, or pick a random time during the hour *and* put in the random sleep, just to be polite and fight the tyranny of the clock.\n\nCover image by [Jon Tyson](https://unsplash.com/@jontyson) on [Unsplash](https://unsplash.com)\n{: .note}\n",[757,9,864],{"slug":1866,"featured":6,"template":688},"tyranny-of-the-clock","content:en-us:blog:tyranny-of-the-clock.yml","Tyranny Of The Clock","en-us/blog/tyranny-of-the-clock.yml","en-us/blog/tyranny-of-the-clock",{"_path":1872,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1873,"content":1879,"config":1886,"_id":1888,"_type":13,"title":1889,"_source":15,"_file":1890,"_stem":1891,"_extension":18},"/en-us/blog/understanding-and-improving-total-blocking-time",{"title":1874,"description":1875,"ogTitle":1874,"ogDescription":1875,"noIndex":6,"ogImage":1876,"ogUrl":1877,"ogSiteName":672,"ogType":673,"canonicalUrls":1877,"schema":1878},"Total Blocking Time - The metric to know for faster website performance","Learn how to identify and fix some root causes for high Total Blocking Time.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749682637/Blog/Hero%20Images/tbt_cover_image.jpg","https://about.gitlab.com/blog/understanding-and-improving-total-blocking-time","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Total Blocking Time - The metric to know for faster website performance\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Jacques Erasmus\"}],\n        \"datePublished\": \"2023-02-14\",\n      }",{"title":1874,"description":1875,"authors":1880,"heroImage":1876,"date":1882,"body":1883,"category":681,"tags":1884},[1881],"Jacques Erasmus","2023-02-14","\n\nOur world overwhelms us with information that is more accessible than ever. The increasing rates of content production and consumption are gifts that keep on giving. We can't seem to keep up with the information thrown at us. We're limited by our cognitive limitations and time constraints, and a [recent study](https://www.nature.com/articles/s41467-019-09311-w) concluded the result is a shortening of attention spans. Websites are no exception.\n\nUsers who interact with your website want feedback, and want it fast. Preferably immediately! Website performance has become an important factor in keeping users engaged. But how do you measure how unresponsive a page is before it becomes fully interactive?\n\nMany [performance metrics](https://web.dev/vitals/) exist, but this blog post focuses on Total Blocking Time (TBT).\n\n## What is Total Blocking Time?\n\nTBT measures the total amount of time tasks were blocking your browser's main thread. This metric represents the total amount of time that a user could not interact with your website. It's measured between [First Contentful Paint (FCP)](https://web.dev/fcp/) and [Time to Interactive (TTI)](https://web.dev/tti/), and represents the combined blocking time for all long tasks.\n\n## What is a long task?\n\nA long task is a process that runs on the main thread for longer than 50 milliseconds (ms). After a task starts, a browser can't interrupt it, and a single long-running task can block the main thread. The result: a website that is unresponsive to user input until the task completes.\n\nAfter the first 50 ms, all time spent on a task is counted as _blocking time_. This diagram shows five tasks, two of which block the main thread for 140 ms:\n\n![A diagram containing five tasks, two of which are blocking the main thread. The TBT for these tasks adds up to 140 ms.](https://about.gitlab.com/images/blogimages/tbt/tasks_diagram.png)\n\n## How can we measure TBT?\n\nMany tools measure TBT, but here we’ll use [Chrome DevTools](https://developer.chrome.com/docs/devtools/evaluate-performance/) to analyze runtime performance.\n\nAs an example: We recently improved performance on GitLab's [**View Source** page](https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab-ci.yml). This screenshot, taken before the performance improvement, shows eight long-running tasks containing a TBT of **2388.16 ms**. That's more than **two seconds**:\n\n![A screenshot indicating that there are eight long-running tasks. The TBT of these tasks adds up to 2388.16 ms.](https://about.gitlab.com/images/blogimages/tbt/summary_before.png)\n\n## How can we improve TBT?\n\nAs you might have guessed by now, reducing the time needed to complete long-running tasks reduces TBT.\n\nBy selecting one of the tasks from the previous screenshot, we can get a breakdown of how the browser executed it. This **Bottom-Up** view shows that much time is spent on rendering content in the Document Object Model (DOM):\n\n![A screenshot of the Bottom-Up view of one of tasks from the previous screenshot. It indicates that most of the time is being spent on rendering content in the DOM.](https://about.gitlab.com/images/blogimages/tbt/task_7_before.png)\n\nThis page has a lot of content that is below the fold – not immediately visible. The browser is spending a lot of resources upfront to render content that is not even visible to the user yet!\n\nSo what can we do? Some ideas:\n\n- **Change the UX.**\n  - Add a Show More button, paging, or virtual scrolling for long lists.\n- **Lazy-load images.**\n  ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/65745))\n    - Lazy-loading images reduces page weight, allowing the browser to spend resources on more important tasks.\n- **Lazy-load long lists.**\n  ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/71633))\n    - Similar to lazy-loading images, this approach allows the browser to spend resources on more important tasks.\n- **Reduce excessive HTML.**\n  ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/65835))\n    - For example, when loading large pages consider removing unnecessary content. Or, consider rendering some content (like icons) with CSS instead.\n- **Defer rendering when possible.**\n    - The [`content-visibility: auto;`](https://developer.mozilla.org/en-US/docs/Web/CSS/content-visibility) CSS property ensures the rendering of off-screen elements (and thus irrelevant to the user) is skipped without affecting the page layout. ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67050))\n    - The [Intersection Observer API](https://developer.mozilla.org/en-US/docs/Web/API/Intersection_Observer_API) allows you to observe when elements intersect with the viewport. This information can be used to show or hide certain elements. ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/71633))\n    - The global [`requestIdleCallback` method](https://developer.mozilla.org/en-US/docs/Web/API/Window/requestIdleCallback?qs=requestIdleCallback) can be used to render content after the browser goes into an idle state.\n  ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/101942/diffs#7eed73783787184e5b1c029b9668e48638f3a6e8_64_78))\n\nFrameworks such as VueJS and React are already heavily optimized. However, be mindful of how you use these frameworks to avoid expensive tasks.\n\n### Change VueJS usage to improve TBT\n\nThis screenshot shows the **Bottom-Up** view of a task. Much of the task time is spent on activities from third-party code in the VueJS framework:\n\n![A screenshot of the Bottom-Up view of one of tasks. It indicates that a lot of the time is being spent on activities in the third-party VueJS framework.](https://about.gitlab.com/images/blogimages/tbt/task_6_before.png)\n\nWhat improvements can we make?\n\n- **Use [Server-side rendering (SSR)](https://gitlab.com/gitlab-org/gitlab/-/issues/215365) or [streaming](https://gitlab.com/gitlab-org/frontend/rfcs/-/issues/101)** for pages that are sensitive to page load performance.\n- **If you don't _need_ Vue, don't use it.**\n  Component instances are a lot more expensive than using plain DOM nodes. Try to avoid unnecessary component abstractions.\n- **Optimize component [props](https://vuejs.org/guide/components/props.html).**\n  Child components in Vue update when at least one of their received props are being updated. Analyze the data that you pass to components. You may find that you can avoid unnecessary updates by making changes to your props strategy.\n- **Use [v-memo](https://vuejs.org/api/built-in-directives.html#v-memo) to skip updates.**\n    - In Vue versions 3.2 and later, `v-memo` enables you to cache parts of your template. The cached template updates and re-renders only if one of its provided dependencies changes.\n- **Use [v-once](https://vuejs.org/api/built-in-directives.html#v-once) for data** that does not need to be reactive after the initial load.\n  ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/101942))\n    - `v-once` ensures the element and component are only rendered once. Any future updates will be skipped.\n- **Reduce expensive tasks in your Vue components.**\n  Even a small script may take a long time to finish if it’s not optimized enough. Some suggestions:\n    - By using [`requestIdleCallback`](https://developer.mozilla.org/en-US/docs/Web/API/Window/requestIdleCallback?qs=requestIdleCallback) you can defer the execution of the non-critical tasks. ([example](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/101942/diffs#7eed73783787184e5b1c029b9668e48638f3a6e8_64_78))\n    - By executing expensive scripts in [WebWorkers](https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers) you can unblock the main thread.\n\n### Results and methods\n\nBy using three of the methods suggested above, we reduced TBT from about **3 seconds** to approximately **500 ms**:\n\n![A chart indicating a drop in TBT from ~3 seconds to ~500 milliseconds.](https://about.gitlab.com/images/blogimages/tbt/chart_after.png)\n\nWhat did we do?\n\n- Deferred rendering by using the [`content-visibility: auto;`](https://developer.mozilla.org/en-US/docs/Web/CSS/content-visibility) CSS property.\n- Deferred rendering by using the [Intersection Observer API](https://developer.mozilla.org/en-US/docs/Web/API/Intersection_Observer_API).\n- Used [v-once](https://vuejs.org/api/built-in-directives.html#v-once) for content that didn't need to be reactive after rendering.\n\nRemember, the size of the decrease always depends on how optimized your app already is to begin with.\n\nThere is a lot more we can do to improve TBT. While the specific approach depends on the app you're optimizing, the general methods discussed here are very effective at finding improvement opportunities in any app. Like most things in life, a series of the smallest changes often yield the biggest impact. So let's [iterate](/blog/dont-confuse-these-twelve-shortcuts-with-iteration/) together, and adapt to this ever-changing world.\n\n> “Adaptability is the simple secret of survival.” – Jessica Hagedorn\n\n_Cover image by [Growtika](https://unsplash.com/@growtika?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/photos/Iqi0Rm6gBkQ?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)_\n",[1885,9,755],"frontend",{"slug":1887,"featured":6,"template":688},"understanding-and-improving-total-blocking-time","content:en-us:blog:understanding-and-improving-total-blocking-time.yml","Understanding And Improving Total Blocking Time","en-us/blog/understanding-and-improving-total-blocking-time.yml","en-us/blog/understanding-and-improving-total-blocking-time",{"_path":1893,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1894,"content":1899,"config":1904,"_id":1906,"_type":13,"title":1907,"_source":15,"_file":1908,"_stem":1909,"_extension":18},"/en-us/blog/upgrading-database-os",{"title":1895,"description":1896,"ogTitle":1895,"ogDescription":1896,"noIndex":6,"ogImage":1562,"ogUrl":1897,"ogSiteName":672,"ogType":673,"canonicalUrls":1897,"schema":1898},"We are upgrading the operating system on our Postgres database clusters","Learn when these upgrades will happen and how they will help boost performance and reliability on GitLab.com.","https://about.gitlab.com/blog/upgrading-database-os","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"We are upgrading the operating system on our Postgres database clusters\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"David Smith\"}],\n        \"datePublished\": \"2022-08-12\",\n      }",{"title":1895,"description":1896,"authors":1900,"heroImage":1562,"date":1901,"body":1902,"category":681,"tags":1903},[921],"2022-08-12","\nContinuing on the theme of [improving the performance and reliability of GitLab.com](/blog/path-to-decomposing-gitlab-database-part1/), we have another step we will be taking for our clusters of Postgres database nodes. These nodes have been running on Ubuntu 16.04 with extended security maintenance patches and it is now time to get them to a more current version. Usually, this kind of upgrade is a behind-the-scenes event, but there is an underlying technicality that will require us to take a maintenance window to do the upgrade (more on that [below](#the-challenge)).\n\nWe have been preparing for and [practicing this upgrade](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7577) and are now ready to schedule the window to do this work for GitLab.com.\n\n## When will the OS upgrade take place and what does this mean for users of GitLab.com?\n\nThis change is planned to take place on 2022-09-03 (Saturday) between 11:00 UTC and 14:00 UTC. The implementation of this change is anticipated to include a **service downtime of up to 180 minutes** (see [reference issue](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7543)). During this time you will experience complete service disruption of GitLab.com.\n\nWe are taking downtime to ensure that the application works as expected following the OS upgrade and to minimize the risk of any data integrity issues.\n\n> Join us at [GitLab Commit 2022](/events/commit/) and connect with the ideas, technologies, and people that are driving DevOps and digital transformation.\n\n## Background\n\nGitLab.com's [database architecture](/handbook/engineering/infrastructure/production/architecture/#database-architecture) uses two Patroni/Postgres database clusters: main and CI. We recently did functional decomposition and now the CI Cluster stores the data generated by CI GitLab features. Each Patroni cluster has primary and multiple read-only replicas. For each of the Patroni clusters, the Postgres database size is ~18 TB running on Ubuntu 16.04. During the scheduled change window, we will be switching over to our newly built Ubuntu 20.04 clusters.\n\n## The challenge\n\nUbuntu 18.10 introduced an updated version of glibc (2.28), which includes a [major update to locale data](https://wiki.postgresql.org/wiki/Locale_data_changes) and causes Postgres indexes created with earlier versions of glibc to be corrupted. Because we are upgrading to Ubuntu 20.04, our indexes are affected by this. Therefore, during the downtime window scheduled for this work, we need to detect potentially corrupt indexes and have them reindexed before we enable production traffic again. We currently have the following types and the approximate number of indexes:\n\n```\n Index Type | # of Indexes\n------------+--------------\n btree      |         4079\n gin        |          101\n gist       |            3\n hash       |            1\n```\n\nAs you can appreciate, given the sheer number (and size) of these indexes, it would take far too long to reindex every single index during the scheduled downtime window, so we need to streamline the process.\n\n## Options to upgrade to Ubuntu 20.04 safely\n\nThere are a number of ways to deal with the problem of potentially corrupt indexes:\n\na. Reindex **all** indexes during the scheduled downtime window\n\nb. Transport data to target 20.04 clusters in a logical (not binary) way, including:\n\n  - Backups/upgrades using pg_dump\n  - Logical replication\n\nc. Use streaming replication from 16.04 to 20.04 and during the downtime window, break replication and promote the 20.04 clusters followed by reindexing of potentially corrupt indexes\n\nIt might be feasible for a small to a medium-size Postgres implementation to use options a or b; however, at the GitLab.com scale, it would require a much larger downtime window and our aim is to reduce the impact to our customers as much as possible.\n\n## High-level approach for the OS upgrade\n\nTo perform an OS upgrade on our Patroni clusters, we use Postgres streaming replication to replicate data from our current Ubuntu 16.04 clusters to the brand new Ubuntu 20.04 standby Patroni clusters. During the scheduled downtime window, we will stop all traffic to the current 16.04 clusters, promote the 20.04 clusters by making them Primary and demote the Ubuntu 16.04 clusters by reconfiguring to act as Standby while replicating from the new 20.04 primaries. We will then reindex all the identified potentially corrupt indexes, and update DNS to point the application to the new 20.04 Patroni clusters before opening traffic to the public.\n\n## Identifying potentially corrupt indexes and our approach to handling the reindexing for different types of indexes\n\n### B-Tree\n\nWe use `bt_index_parent_check` [amcheck function](https://www.postgresql.org/docs/12/amcheck.html) to identify potentially corrupt indexes and we will reindex them during the downtime window.\n\n### GiST and Hash\n\nSince we do not have many GiST and Hash indexes, and reindexing them is a relatively quick operation, we will reindex them all during the downtime window.\n\n### GIN\n\nCurrently, the production version of amcheck is limited to detecting potential corruption in B-Tree indexes only. Our GIN indexes are reasonably sized and it would require a significant amount of time to reindex them during the scheduled downtime window, which is not feasible as we cannot have the site unavailable to our customers for that long. We have collaborated closely with our database team to produce a list of business-critical GIN indexes to be reindexed **during** the downtime window, and any other GIN indexes will be reindexed immediately after we open up traffic to the public using the [CONCURRENTLY](https://www.postgresql.org/docs/current/sql-reindex.html#SQL-REINDEX-CONCURRENTLY) option. Using this option means it will take longer to reindex, but it allows normal operations to continue while the indexes are being rebuilt.\n\n## Performance improvements\n\nWe started looking into options to improve the performance of the reindexing (see [reference issue](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15559#note_940517257)). There are a couple of areas where we needed to improve performance.\n\n### Identify potentially corrupt B-Tree indexes quickly\n\nWhen we first started using the amcheck to identify potentially corrupt indexes, it was single threaded so it was taking just under five days to run the amcheck script to identify potentially corrupt indexes on production data. After a few iterations, our amcheck script now runs a separate background worker process for each index, so we essentially get a performance improvement of about 96 times when we use a 96 CPU core VM to run amcheck. The performance is limited by the time it takes to run amcheck on the largest index. The script is customizable to skip or include a specific set of tables/indexes, and we can decide the number of parallel worker processes to use based on the number of CPU cores available on the VM we use to run amcheck. Now with the improved speed, we can run the amcheck script on a copy of production data a day or two before the scheduled OS upgrade downtime window.\n\n### Improve reindexing speed to reduce the downtime\n\nOur initial test to reindex was performed sequentially with the default Postgres parameters. We have tested reindexing with different Postgres parameters and parallelized the reindex process. We are now able to perform our reindexing in less than half the time it used to take to reindex.\n\n## Reading material\n\nFor more information, please see the following links:\n\n- [Ubuntu 20.04 Upgrade Epic](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/637)\n- [Research on the types of indexes and steps to identify corruption](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15384#note_867281334)\n",[948,9,707],{"slug":1905,"featured":6,"template":688},"upgrading-database-os","content:en-us:blog:upgrading-database-os.yml","Upgrading Database Os","en-us/blog/upgrading-database-os.yml","en-us/blog/upgrading-database-os",{"_path":1911,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1912,"content":1918,"config":1924,"_id":1926,"_type":13,"title":1927,"_source":15,"_file":1928,"_stem":1929,"_extension":18},"/en-us/blog/using-run-parallel-jobs",{"title":1913,"description":1914,"ogTitle":1913,"ogDescription":1914,"noIndex":6,"ogImage":1915,"ogUrl":1916,"ogSiteName":672,"ogType":673,"canonicalUrls":1916,"schema":1917},"How we used parallel CI/CD jobs to increase our productivity","GitLab uses parallel jobs to help long-running jobs run faster.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749666717/Blog/Hero%20Images/cover-image.jpg","https://about.gitlab.com/blog/using-run-parallel-jobs","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"How we used parallel CI/CD jobs to increase our productivity\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Miguel Rincon\"}],\n        \"datePublished\": \"2021-01-20\",\n      }",{"title":1913,"description":1914,"authors":1919,"heroImage":1915,"date":1921,"body":1922,"category":681,"tags":1923},[1920],"Miguel Rincon","2021-01-20","\n\nAt GitLab, we must verify simultaneous changes from the hundreds of people that contribute to GitLab each day. How can we help them contribute efficiently using our pipelines?\n\nThe pipelines that we use to build and verify GitLab have more than 90 jobs. Not all of those jobs are equal. Some are simple tasks that take a few seconds to finish, while others are long-running processes that must be optimized carefully.\n\nAt the time of this writing, we have more than 700 [pipelines running](https://gitlab.com/gitlab-org/gitlab/-/pipelines?page=1&scope=all&status=running). Each of these pipelines represent changes from team members and contributors from the wider community. All GitLab contributors must wait for the pipelines to finish to make sure the change works and integrates with the rest of the product. We want our pipelines to finish as fast as possible to maintain the productivity of our teams.\n\nThis is why we constantly monitor the duration of our pipelines. For example, in December 2020, successful merge request pipelines had a duration of [53.8 minutes](/handbook/engineering/quality/performance-indicators/#average-merge-request-pipeline-duration-for-gitlab):\n\n![Average pipeline duration was 53.8 minutes in December](https://about.gitlab.com/images/blogimages/using-run-parallel-jobs/historical-pipeline-duration.png){: .shadow.medium.center}\nThe average pipeline took 53.8 minutes to finish in December.\n{: .note.text-center}\n\nGiven that we run [around 500 merge request pipelines](https://gitlab.com/gitlab-org/gitlab/-/pipelines/charts) per day, we want to know: Can we optimize our process to change how long-running jobs _run_?\n\n## How we fixed our bottleneck jobs by making them run in parallel\n\nThe `frontend-fixtures` job uses `rspec` to generate mock data files, which are then saved as files called \"fixtures\". These files are loaded by our frontend tests, so the `frontend-fixtures` must finish before any of our frontend tests can start.\n\n> As not all of our tests need these frontend fixtures, many jobs use the [`needs` keyword](https://docs.gitlab.com/ee/ci/yaml/#needs) to start before the `frontend-fixtures` job is done.\n\nIn our pipelines, this job looked like this:\n\n![The `frontend-fixtures` job](https://about.gitlab.com/images/blogimages/using-run-parallel-jobs/fixtures-job.png){: .shadow.medium.center}\nInside the frontend fixtures job.\n{: .note.text-center}\n\n\nThis job had a normal duration of 20 minutes, and each individual fixture could be generated independently, so we knew there was an opportunity to run this process in parallel.\n\nThe next step was to configure our pipeline to split the job into multiple batches that could be run in parallel.\n\n## How to make frontend-fixtures a parallel job\n\nFortunately, GitLab CI provides an easy way to run a job in parallel using the [`parallel` keyword](https://docs.gitlab.com/ee/ci/yaml/#parallel). In the background, this creates \"clones\" of the same job, so that multiple copies of it can run simultaneously.\n\n**Before:**\n\n```yml\nfrontend-fixtures:\n  extends:\n    - .frontend-fixtures-base\n    - .frontend:rules:default-frontend-jobs\n```\n\n**After:**\n\n```yml\nrspec-ee frontend_fixture:\n  extends:\n    - .frontend-fixtures-base\n    - .frontend:rules:default-frontend-jobs\n  parallel: 2\n```\n\nYou will notice two changes. First, we changed the name of the job, so our job is picked up by [Knapsack](https://docs.knapsackpro.com/ruby/knapsack) (more on that later), and then we add the keyword `parallel`, so the job gets duplicated and runs in parallel.\n\nThe new jobs that are generated look like this:\n\n![Our fixtures job running in parallel](https://about.gitlab.com/images/blogimages/using-run-parallel-jobs/fixtures-job-parallel.png){: .shadow.medium.center}\nThe new jobs that are picked up by Knapsack and run in parallel.\n{: .note.text-center}\n\nAs we used a value of `parallel: 2`, actually two jobs are generated with the names:\n\n- `rspec-ee frontend_fixture 1/2`\n- `rspec-ee frontend_fixture 2/2`\n\nOur two \"generated\" jobs, now take three and 17 minutes respectively, giving us an overall decrease of about three minutes.\n\n![Two parallel jobs in the pipeline](https://about.gitlab.com/images/blogimages/using-run-parallel-jobs/fixtures-job-detail.png){: .shadow.medium.center}\nThe parallel jobs that are running in the pipeline.\n{: .note.text-center}\n\n## Another way we optimized the process\n\nAs we use Knapsack to distribute the test files among the parallel jobs, we were able to make more improvements by reducing the time it takes our longest-running fixtures-generator file to run.\n\nWe did this by splitting the file into smaller batches and optimizing it, so we have more tests running in parallel, which shaved off an additional [~3.5 minutes](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/47158#note_460372560).\n\n## Tips for running parallel jobs\n\nIf you want to ramp up your productivity you can leverage `parallel` on your pipelines by following these tips:\n\n1. Measure the time your pipelines take to run and identify possible bottlenecks to your jobs. You can do this by checking which jobs are slower than others.\n1. Once your slow jobs are identified, try to figure out if they can be run independently from each other or in batches.\n   - Automated tests are usually good candidates, as they tend to be self-contained and run in parallel anyway.\n1. Add the `parallel` keyword, while measuring the outcome over the next few running pipelines.\n\n## Learn more about our solution\n\nWe discuss how running jobs in parallel improved the speed of pipelines on GitLab Unfiltered.\n\n\u003C!-- blank line -->\n\u003Cfigure class=\"video_container\">\n  \u003Ciframe src=\"https://www.youtube-nocookie.com/embed/hKsVH_ZhSAk\" frameborder=\"0\" allowfullscreen=\"true\"> \u003C/iframe>\n\u003C/figure>\n\u003C!-- blank line -->\n\nAnd here are links to some of the resources we used to run pipelines in parallel:\n\n- The [merge request that introduced `parallel` to fixtures](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/46959).\n- An important [optimization follow-up](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/47158) to make one of the slow tests faster.\n- The [Knapsack gem](https://docs.knapsackpro.com/ruby/knapsack), which we leverage to split the tests more evenly in multiple CI nodes.\n\nAnd many thanks to [Rémy Coutable](/company/team/#rymai), who helped me implement this improvement.\n\nCover image by [@dustt](https://unsplash.com/@dustt) on [Unsplash](https://unsplash.com/photos/ZqBNb7xK5s8)\n{: .note}\n",[754,683,684,9,732],{"slug":1925,"featured":6,"template":688},"using-run-parallel-jobs","content:en-us:blog:using-run-parallel-jobs.yml","Using Run Parallel Jobs","en-us/blog/using-run-parallel-jobs.yml","en-us/blog/using-run-parallel-jobs",{"_path":1931,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1932,"content":1937,"config":1943,"_id":1945,"_type":13,"title":1946,"_source":15,"_file":1947,"_stem":1948,"_extension":18},"/en-us/blog/value-stream-total-time-chart",{"title":1933,"description":1934,"ogTitle":1933,"ogDescription":1934,"noIndex":6,"ogImage":1199,"ogUrl":1935,"ogSiteName":672,"ogType":673,"canonicalUrls":1935,"schema":1936},"Value stream optimization with GitLab's Total Time Chart","Learn how this new analytics feature provides immediate insights about the time spent in each stage of your workstream.","https://about.gitlab.com/blog/value-stream-total-time-chart","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Value stream management: Total Time Chart simplifies top-down optimization flow\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Haim Snir\"}],\n        \"datePublished\": \"2023-06-01\",\n      }",{"title":1938,"description":1934,"authors":1939,"heroImage":1199,"date":1940,"body":1941,"category":730,"tags":1942},"Value stream management: Total Time Chart simplifies top-down optimization flow",[818],"2023-06-01","\n\nUnderstanding where time is spent during the development lifecycle is a crucial insight for software leaders when optimizing the value delivery to customers. Our new Value Stream Analytics Total Time Chart is a visualization that helps managers uncover how long it actually takes to complete the development process from idea to production. Managers also can learn how much time teams spend in each stage of the workflow.\n \n![The VSA Total Time Chart displays the average time to complete each value stream stage.](https://about.gitlab.com/images/blogimages/2023-05-07-vsa-overview.gif){: .shadow}\nValue Stream Analytics Total Time Chart\n{: .note.text-center}\n\nValue Stream Analytics is available out of the box in the GitLab platform. It surfaces the process and value delivery metrics through the unified data model that stores all the records around development efforts. Value Stream Analytics uses a backend process to collect and aggregate stage-level data into [three core objects](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#how-value-stream-analytics-works):\n\n- Value streams - container objects with stage list \n- Value stream stage - an event pair of start and end events\n- Value stream stage events - the smallest building blocks of the value stream. For example, from Issue created to Issue first added to board. See the [list of available stage events](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#value-stream-stage-events).\n\n> [Register for the GitLab 16 webinar](/sixteen/), where we will unveil the latest innovations in our AI-powered DevSecOps platform.\n\nWe added in the new chart the stages breakdown as a stacked area chart to make it easier to understand how each stage contributes to the total time, and how that changes over time. Each area in the chart represents a stage. By comparing the heights of each area, you can get an idea about how each stage contributes to the total time of the value stream. We also added a tool tip with the stages breakdown sorted top to bottom, to help you understand the stages in their correct order.\n\nThe new chart is available in the Value Stream Analytics Overview page (on the left sidebar, select **Analytics > Value stream**). This page includes four sections:\n  1.  Data filter text box - on the top of the Overview page you can use the [Data filters](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#data-filters) to view data that matches specific criteria or date range. \n  2. Stage navigation bar - below the filter text box you can use the the stage navigation bar to investigate what happened in the specific stage and to identify the items (issues/MRs) that are slowing down the stage time.\n  3. Key metrics tiles - the summary of the stream performance is displayed, above the chart in the [Key metrics tiles](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#key-metrics). \n  4. Overview charts - the newly added Total Time Chart and the [Task by type](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#view-tasks-by-type) chart. \n\nBut that's not all. The Total Time Chart also simplifies the top-down optimization flow, starting from the Value Streams Dashboard organization-level view to a drill-down into the performance of each project:\n\n\u003Ciframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/EA9Sbks27g4\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen>\u003C/iframe>\n\n\nFrom the Value Stream Analytics overview page, you can drill down from Key metrics tiles into other GitLab analytics pages for deeper investigations. You can also go up to the Value Streams Dashboard, or investigate the [DORA metrics](/solutions/value-stream-management/dora/) that are also available in the new dashboard.\n\nIt's important to note that the chart data is limited to items completed within the selected date range. Also, there could be points in time with no [\"stage event\"](https://docs.gitlab.com/ee/user/group/value_stream_analytics/#value-stream-stage-events) actions. In these cases, the chart will display a dashed line to represent the missing data. These gaps can add contextual information about the workstream, and usually do not represent interruptions in the data. When there is \"no data\" for a specific stage, the stage line will be flat.\n\nTo learn more check out the [Value Stream Analytics documentation](https://docs.gitlab.com/ee/user/group/value_stream_analytics/).\n\nWith the Value Stream Analytics Total Time Chart, you get immediate insights about the time spent in each stage over time to determine if progress is being made. Try it out today and see the difference it can make in your workstream!\n",[843,707,823,9,732],{"slug":1944,"featured":6,"template":688},"value-stream-total-time-chart","content:en-us:blog:value-stream-total-time-chart.yml","Value Stream Total Time Chart","en-us/blog/value-stream-total-time-chart.yml","en-us/blog/value-stream-total-time-chart",{"_path":1950,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1951,"content":1957,"config":1964,"_id":1966,"_type":13,"title":1967,"_source":15,"_file":1968,"_stem":1969,"_extension":18},"/en-us/blog/vestiaire-collective-on-moving-to-a-devsecops-platform",{"title":1952,"description":1953,"ogTitle":1952,"ogDescription":1953,"noIndex":6,"ogImage":1954,"ogUrl":1955,"ogSiteName":672,"ogType":673,"canonicalUrls":1955,"schema":1956},"Vestiaire Collective's DevSecOps migration: Wins and insights","Support for container registries and integrations with existing tools were the top reasons for the ecommerce company's migration to GitLab.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749670278/Blog/Hero%20Images/fasttrack.jpg","https://about.gitlab.com/blog/vestiaire-collective-on-moving-to-a-devsecops-platform","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Vestiaire Collective VP shares wins, insights, and what's next with DevSecOps migration\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Chandler Gibbons\"}],\n        \"datePublished\": \"2023-01-05\",\n      }",{"title":1958,"description":1953,"authors":1959,"heroImage":1954,"date":1961,"body":1962,"category":1227,"tags":1963},"Vestiaire Collective VP shares wins, insights, and what's next with DevSecOps migration",[1960],"Chandler Gibbons","2023-01-05","\n[Vestiaire Collective](https://us.vestiairecollective.com/), an online marketplace for second-hand clothing and luxury items, needed a faster and easier-to-use solution for code reviews and running pipelines. In 2018, the company migrated its codebase to GitLab for its speed and flexibility in setting up custom workflows and pipelines for releases. Since making the move, Vestiaire Collective has taken advantage of GitLab’s integrations with other tools — including [Jenkins for CI/CD](https://docs.gitlab.com/ee/integration/jenkins.html), [Jira](https://docs.gitlab.com/ee/integration/jira/) for issue management, and Nexus artifact storage — to improve productivity and simplify complex toolchains. We talked to Sardorbek Pulatov, vice president of engineering at Vestiaire Collective, about what his team has been able to achieve with the GitLab DevSecOps Platform and the lessons learned along the way.\n\n**What were the challenges that led Vestiaire Collective to explore GitLab?**\n\nWhen Vestiaire Collective started with GitLab back in 2018, we wanted to have a fast and in-house version control system with features such as running pipelines. One of the biggest chunks of our code base, the monolith, was on Subversion. We migrated to GitLab for speed and also the better maintainability, and code reviews being much easier. GitLab has also enabled us to set up workflows and pipelines for our releases. And recently we also created our own tool for releases because we have a custom workflow in Jira.\n\nNow we have not just engineers in GitLab, but also data engineers and data scientists. So, for example, data scientists manage their releases through their repositories in GitLab. They’re actually quite advanced in using GitLab, the data scientist teams. So they use everything new released by GitLab.\n\n**Since moving to a single platform for DevSecOps, what are the biggest benefits you’ve noticed? How has GitLab helped Vestiaire Collective simplify complicated toolchains?**\n\nWhen GitLab released support for container registries and npm, it was such a relief for us because we were using Amazon Elastic Container Registry (ECR) and it was slow because it was in a different location — we deploy in Ireland but our team is spread across Europe and the United States. We also tried to use our own setup with Nexus and support it ourselves, meaning if there was a vulnerability we would need to update it and maintain it separately. Even if that’s only required once every six months, it still takes time. You still need to plan the upgrade. But with GitLab, our problem was solved. Now developers have [a registry for containers inside GitLab](https://docs.gitlab.com/ee/user/packages/container_registry/) so they can easily push new releases of their services.\n\nThe fact that GitLab integrates with the other tools we are using has also been a huge benefit. We use Jira for project management, and thanks to GitLab’s Jira integration, whenever a developer pushes a commit in GitLab it’s fully visible in Jira. And now, with our custom integration, the releases are also synced, so when you create a release in GitLab, it creates a release with the same ticket in Jira.\n\nAs a next step, personally, I would love us to be able to migrate entirely into GitLab for project management, using GitLab [issues](https://docs.gitlab.com/ee/user/project/issues/index.html) and [epics](https://docs.gitlab.com/ee/user/group/epics/). We’re not there yet, but GitLab provides almost all the functionality needed for developers. Tracking everything in GitLab would make it much easier to reference the issues in code reviews. Now, when you create a ticket in Jira, you need to create a branch in GitLab with the Jira ticket number, and then, when you push a commit, you also need to remember the ticket number. But once everything is in GitLab, we’ll be able to just push a commit to a merge request. GitLab already gives us so much transparency into what we are doing. That would be even greater if everyone was using GitLab issues and epics.\n\n**What has the response from your team been like?**\n\nThere have been no complaints about stability or performance, and the performance is improving release by release! GitLab became very fast with [version 15](/releases/2022/05/22/gitlab-15-0-released/) — I can feel and see the performance boost. People are happy. People have been quiet, and when engineers are not complaining, that means that the tool is quite good. \n\n**For companies that are just getting started with GitLab, what advice would you give them on where to start?**\n\nI’d recommend starting with smaller projects, setting up all the steps needed for your pipeline, and trying to use features of GitLab such as issues and epics. In our case, we started with a larger project from our Product Information Management service team — the project’s repository had three services and we needed to run different pipelines for different changes. And even in our case, GitLab was quite flexible. We could say, “Okay, if a commit message has this specific word, then run these steps. If it has this word, run these other steps.”\n\nWhat we learned from that experience was that first it’s valuable to understand what you need to run as a pipeline. What comes to mind first is tests and probably deployment into an environment. Then we need to monitor the performance and see if we need to pass our caches in between the pipelines to speed up the deployment, or in the case of Node.js, do not download [npm packages](https://docs.gitlab.com/ee/user/packages/npm_registry/) in every change or merge request or branch. Just cache it once in the first run. Then you can optimize step by step. So that’s what I mean by starting small.\n\n**What are you most looking forward to doing with GitLab in the future?**\n\nI love this question. First, I would like to point out that GitLab surprises me with each release. Personally, I am looking forward to using more automation tools for QA engineers, as well as auto pipelines and integrations with the latest automation frameworks.\n\nWe recently moved away from Sentry for error tracking, so I’m also interested in exploring doing [error tracking in GitLab](https://docs.gitlab.com/ee/operations/error_tracking.html). And, I’m interested in seeing how we might be able to use [feature flags in GitLab](https://docs.gitlab.com/ee/operations/feature_flags.html). We’re currently using LaunchDarkly for A/B testing, but if GitLab can even match half of that functionality, it would be great to bring everything together into one platform.\n\nFinally, we’re also looking into how we can make our GitLab implementation even better and more stable, so we want to deploy it into [a Kubernetes cluster](https://docs.gitlab.com/ee/user/clusters/agent/). Currently, it’s just deployed into EC2s, so that would be our next big step for GitLab.\n\nPhoto by [Mathew Schwartz](https://unsplash.com/@cadop?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com)\n",[758,1187,9,683,684],{"slug":1965,"featured":6,"template":688},"vestiaire-collective-on-moving-to-a-devsecops-platform","content:en-us:blog:vestiaire-collective-on-moving-to-a-devsecops-platform.yml","Vestiaire Collective On Moving To A Devsecops Platform","en-us/blog/vestiaire-collective-on-moving-to-a-devsecops-platform.yml","en-us/blog/vestiaire-collective-on-moving-to-a-devsecops-platform",{"_path":1971,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1972,"content":1978,"config":1984,"_id":1986,"_type":13,"title":1987,"_source":15,"_file":1988,"_stem":1989,"_extension":18},"/en-us/blog/why-all-organizations-need-prometheus",{"title":1973,"description":1974,"ogTitle":1973,"ogDescription":1974,"noIndex":6,"ogImage":1975,"ogUrl":1976,"ogSiteName":672,"ogType":673,"canonicalUrls":1976,"schema":1977},"Why Prometheus is for everyone","You think you don't need Prometheus – I'm here to tell you why you're wrong. Learn why GitLab uses Prometheus, and why your organization should be using it too!","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749678778/Blog/Hero%20Images/monitoring-cover.png","https://about.gitlab.com/blog/why-all-organizations-need-prometheus","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Why Prometheus is for everyone\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Lee Matos\"}],\n        \"datePublished\": \"2018-09-27\",\n      }",{"title":1973,"description":1974,"authors":1979,"heroImage":1975,"date":1981,"body":1982,"category":681,"tags":1983},[1980],"Lee Matos","2018-09-27","\nIt's no secret that here at GitLab, we hitched our wagon to [Prometheus](https://docs.gitlab.com/ee/administration/monitoring/prometheus/index.html#doc-nav) long ago. We've been\n[shipping it with GitLab since 8.16](/releases/2017/01/22/gitlab-8-16-released/). Having said that,\neven within GitLab we weren't all using Prometheus. The Support Engineering team was\nvery much in the camp of \"We don't need this to troubleshoot customer problems.\" We were wrong;\nwe needed Prometheus all along, and here's why your organization should be using it too.\n\n## What is Prometheus?\n\nFor a short answer, Prometheus is software that stores event data in real-time. But more specifically…\n\nPrometheus is a powerful and free open-source software monitoring service that records real-time metrics and provides real-time alerts. It’s built with an HTTP pull model. Prometheus collects data performance metrics which you can view through an external dashboard tool (such as Grafana) or by directly connecting to Prometheus. \n\nSoundcloud was the original developer of Prometheus but nowadays is continuously maintained by the Cloud Native Computing Foundation (CNCF). The cloud-native architecture of Prometheus has made it extremely popular as part of a modern technology stack. \n\n## Prometheus is great, so why isn't everyone using it already?\n\nI think GitLab customers fall into a few categories: You have the customer who wants to use GitLab\nbut can't or doesn't want to manage servers. They'll use [GitLab.com](/pricing/)! By making that choice they can\nleverage the hard work of our Production team and reap the benefits of what Prometheus has to offer.\n\nThen you have the customer who is [running their own simple GitLab deployment](/pricing/#self-managed), but they may\nnot know or appreciate the value of Prometheus metrics. The Support Engineering team was\nlike this too! We thought, \"We can use traditional tools. Just knowing about where logging is,\nknowing about the system, is enough to actually solve the problems that we see. Just having\nexperience is enough.\" Not so.\n\nThen you have large, enterprise customers who are deploying GitLab clusters with multiple dozens of\nservers and a lot of moving parts. For them, Prometheus really shines because the complexity\nballoons, and once you move GitLab from a single server to three, or four, or 20, being able\nto see all of the metrics in one view makes a huge difference in time to resolving critical infrastructure issues.\n\n## How we saw the light about Prometheus\n\nA large GitLab customer was experiencing a really strange, catastrophic failure scenario, and\nthe problem was proving evasive to the support team. Even after days of troubleshooting we couldn't\nfind what we were looking for, so we called in [Jacob](/company/team/#jacobvosmaer) from our\n[Gitaly](/blog/the-road-to-gitaly-1-0/) team because it looked like Gitaly was at the\ncore of the problem. We had been using Gitaly on GitLab.com for about six months at that point\nand he had never seen it behave this way before. It looked like Gitaly was accessing Git data,\nbut just _extremely slow_, and it would spread across the cluster one server at a time. Jacob\nand I speculated and made some Gitaly dashboards, and while that was a good moment of cross-team\ncollaboration, he was stumped.\n\nMost of the time when we're debugging GitLab, it's clear to pinpoint the root of the problem.\nBut in this case, it was a catastrophic failure across the entire cluster that was a ticking timebomb.\nWhen we'd see the indicators we'd effectively have 15-35 minutes before the entire fleet was down.\nThis customer actually had Prometheus on their roadmap but hadn't deployed it yet, so when\nthe failure happened it was top of our list of things to deploy:\n\n**Support**: We should focus on trying to understand why this host is affected.\n\n**Production**: If we get better observability with Prometheus we'll move faster.\n\n**Support**: I'm worried this is a distraction! We don't have much time.\n\n**Production**: Watch and learn. Watch and learn.\n\n_(Cue dramatic montage of hackers with GitLab stickers on their laptops feverishly typing under duress)_\n\nOnce Prometheus was in place, we called in the Production team. They run one of the largest\nGitLab instances: GitLab.com. We exported their dashboard and gave it to the customer, so\nwithin minutes they had a GitLab production-scale dashboard that was all of the things that\nour production engineers use. Now, we could leverage the wealth of knowledge of our Prometheus\nexperts, as it's a familiar interface and they know exactly what they're looking at.\n\nWith that as a starting point we started querying and slicing data, and dashboards, which let\nus build a couple of different facets that let us view the data and come to some conclusions.\n\"Okay, it looks like once a host becomes 'tainted,' all Git-level operations spike and _HALT_.\nNow we can finally ask the question, why?\" And then, when we asked that, we saw that it was\na problem with Amazon's EFS file system. We had hit some upper boundary of EFS access and,\nhaving identified it, we were able to fix it by moving those specific files out of EFS. After we\nmade that change it was easy to use Prometheus and Grafana to verify that the state was sound\nand everything was working as expected afterwards, without even lifting a finger. We just looked\nat the dashboard in place. So while the customer had intended to deploy Prometheus later this\nyear, now, in this emergency situation, Prometheus definitely saved the day and is a huge part\nof keeping their GitLab infrastructure healthy. Without it we wouldn't be nearly as confident\nor comfortable in our solution.\n\n## Prometheues has opened up a whole world of possibilities.\n\nWe have another large client that's on an older version of GitLab without Prometheus. We're\nworking to debug things there and while we're able to do it, it's slower going. It requires a lot\nmore manual effort to coalesce the data and get it in a form we can use. It often takes about\n35-40 minutes to get the data, slicing with grep, AWK, and friends and at least one man page\nto look up syntax. Whereas, with Prometheus and Grafana, we'd be able to just access and view\nthe data, query it, and affect it within minutes. We already have a lot of [built-in monitoring capabilities](https://docs.gitlab.com/ee/administration/monitoring/). GitLab is a complex\nsystem built of various open source sub-systems, and we're monitoring all of them with Prometheus.\nYou can too.\n\n### Everyone should be using our GitLab.com dashboard\n\nAs I said earlier, in our intense, catastrophic scenario we gave the customer our GitLab.com\ndashboard. Any customer can use this dashboard as a template! You literally can go to [dashboards.gitlab.com](https://dashboards.gitlab.com), click \"export,\" get the dashboard, run your instance, then click \"import.\" It will show up, and\nyou just need to tweak the name so that it's not defaulting to our GitLab Production cluster.\nThen Prometheus just fills in the data.\n\n\u003Ciframe src=\"https://giphy.com/embed/12NUbkX6p4xOO4\" width=\"480\" height=\"440\" frameBorder=\"0\" class=\"giphy-embed\" allowFullScreen>\u003C/iframe>\n\nWe're trying to standardize around using the dashboards here, so that while there are differences\nand nuances in the deployments etc., we're speaking a common language, and have a common\nmeeting point for GitLab engineers across teams to monitor and talk GitLab performance.\n\n## Are you convinced about Prometheues yet?\n\nWe're now actively training our support team on Prometheus. And it's likely that other organizations\nprobably have the same thing happening – where another group could be impacting or helping,\nbut they're not collaborating, so they can't see where or how they can help one another. We've\nseen the light! So, we're training our team on Prometheus, and it's something that we want\nto make sure that everybody can make use of.\n\nMany customers think they don't need Prometheus and are reluctant to use it because it adds\noverhead; you have to configure it and set it up, and it may require a bit of finessing. GitLab\nis trying to make that even easier, but right now when you're building a bespoke deployment,\nit requires a bit of time, and you may not think time invested is worth it. And I'm here to say,\nit is, get it now! In fact, it's already there. You just need to turn it on! I'm advocating that all\nlarge, customer deployments over 500 users have Prometheus running by 2019. Turn it on and\nthen we'll all reap the rewards.\n",[823,754,9],{"slug":1985,"featured":6,"template":688},"why-all-organizations-need-prometheus","content:en-us:blog:why-all-organizations-need-prometheus.yml","Why All Organizations Need Prometheus","en-us/blog/why-all-organizations-need-prometheus.yml","en-us/blog/why-all-organizations-need-prometheus",{"_path":1991,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":1992,"content":1998,"config":2005,"_id":2007,"_type":13,"title":2008,"_source":15,"_file":2009,"_stem":2010,"_extension":18},"/en-us/blog/why-the-market-is-moving-to-a-platform-approach-to-devsecops",{"title":1993,"description":1994,"ogTitle":1993,"ogDescription":1994,"noIndex":6,"ogImage":1995,"ogUrl":1996,"ogSiteName":672,"ogType":673,"canonicalUrls":1996,"schema":1997},"Why the market is moving to a platform approach to DevSecOps","A single DevOps platform improves ROI, the developer experience, and customer retention and satisfaction.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749667886/Blog/Hero%20Images/cobolshortage.jpg","https://about.gitlab.com/blog/why-the-market-is-moving-to-a-platform-approach-to-devsecops","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Why the market is moving to a platform approach to DevSecOps\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"GitLab\"}],\n        \"datePublished\": \"2022-10-24\",\n      }",{"title":1993,"description":1994,"authors":1999,"heroImage":1995,"date":2001,"body":2002,"category":730,"tags":2003},[2000],"GitLab","2022-10-24","The market is moving to a platform approach to [DevSecOps](/topics/devsecops/). What had previously been a process that let different engineering teams adopt their own tools for different stages of the software development lifecycle – what we call “DIY DevOps” – is being replaced by a method that leverages a single application.\n\nWhy is this happening? First, IT managers are coming to grips with the inefficiencies and cost of toolchain sprawl. Second, executives are relying on digital transformation to solve significant business-level problems: improving developer onboarding and productivity, building high-performing teams, securing the software supply chain, and creating a secure on-ramp to the public cloud. Finally, there’s the impact of [the potential recession](https://www.worldbank.org/en/news/press-release/2022/09/15/risk-of-global-recession-in-2023-rises-amid-simultaneous-rate-hikes), which has accelerated the above trends.\n\nWe recently commissioned a [Forrester Consulting “Total Economic Impact™ of GitLab’s Ultimate Plan” study](https://page.gitlab.com/resources-study-forrester-tei-gitlab-ultimate.html) to better understand how companies save on costs and achieve business and technology goals with GitLab. We focused on our Ultimate tier, which is the fastest growing part of the business. We believe the results align with the business requirements needed to endure economic headwinds and position companies for success: strong return on technology investment, cost savings through technical tool consolidation, a faster pace of application releases to acquire and retain customers, greater development and delivery efficiency, increased and simplified security, and a rapid payback period. \n\nGitLab’s DevOps platform enables source code management, continuous integration/continuous delivery, advanced security capabilities, and more in a single application. The Forrester study found that combination led to:\n\n* Three-year ROI of 427%\n* 12x increase in the number of annual releases for revenue generation applications\n* 87% improvement in development and delivery efficiency time\n* Less than six-month payback period\n\n## Understanding DevOps pain points\n\nTo realize the benefits of a single DevOps platform, organizations have to assess their pain points. Here are some common development lifecycle obstacles that affect organizations of all sizes:\n\n* Complex toolchains and processes\n* Inefficient development environments\n* Lack of security skills\n* Rushed development cycles\n* No single source of truth or single code repository\n* Poor software testing practices\n\nAll of these pain points can impede an organization’s ability to manage through a recession and recovery. \n\n## The benefits of a DevOps platform\n\nThe Forrester study found that GitLab Ultimate provided a composite organization, based on interviewed customers, 10 key quantified benefits over a three-year period. While each benefit on its own could have a positive impact on a business’s ability to stay steady and even thrive during difficult economic times, together they are a powerhouse that can eliminate many pain points.\n\nHere are five of those benefits of the GitLab Ultimate Plan:\n\n### Vulnerability management\n\nAs GitLab’s 2022 Global DevSecOps Survey found, [security is top of mind](/blog/gitlabs-2022-global-devsecops-survey-security-is-the-top-concern-investment/) for all DevOps organizations. Yet security at scale can be challenging, especially finding and hiring professionals with the right skills.\n\nA benefit of GitLab Ultimate, according to the Forrester study, is greater efficiency in managing vulnerabilities. The DevOps platform [integrates and automates vulnerability management](/direction/govern/threat_insights/vulnerability_management/) within the development lifecycle. Issues can be identified, logged, triaged, tracked, and remediated – all in the same DevOps application. Developers can address vulnerabilities in real time, avoiding release delays or software defects and bugs. According to Forrester, the composite organization realized savings of “hours a week because developers have access to better context about the vulnerabilities. This in turn means less back and forth between development and QA/security on an issue.”\n\n### Less homegrown tool development/open source solution management\n\nDevOps teams often spend a considerable amount of time creating tools they need from scratch or finding and managing open source options. GitLab reduces [toolchain complexity (a.k.a. debt)](/blog/battling-toolchain-technical-debt/) by building into the platform the tools and features developers need, enabling them to manage their environment as a single application. GitLab Ultimate enabled the Forrester study’s composite organization to shift “from manually intensive tasks requiring the full attention of the developer, security, and operations teams to an environment where they now spend no more than a few hours per day per person on the same tasks.”\n\n### Efficient development\n\nA highly efficient development process impacts the developer experience, which improves retention. GitLab Ultimate enabled the composite organization to develop code faster, deliver higher quality code, enable better collaboration, and improve the ability to monitor applications, according to the Forrester study. Other advantages include: more streamlined processes, better efficiency among developers and non-technical teammates, and improved visibility and collaboration across the SDLC.\n\n### Better code quality\n\nPoor code quality directly affects a company’s ability to attract and retain customers. GitLab enabled the composite organization to have “a single application that streamlines processes to ensure code is tested, scanned, and verified before it is released,” according to the Forrester study. The result is high-quality code (with reduced defects and bugs) that meets security standards.\n\n### More releases, faster\n\nOrganizations want to be able to address customer needs for newer applications, updates, and enhanced feature sets in a timely fashion. With GitLab, the composite organization can “increase the velocity of updates and releases, allowing it to meet customers’ rising digital demands.”\n\nDevOps brought about the following unquantified benefits for the composite organization, according to the Forrester study: more satisfied employees because they are more productive and collaborative; more satisfied customers because of a smoother project workflow, improved release quality, and a faster release frequency; and improved market innovation and competitiveness due to faster development lifecycle and time to market.\n\nWhile DevOps platform benefits are applicable to any economic environment, they are even more so in this time of economic uncertainty. GitLab enables organizations to extract the most out of their DevOps environment and achieve faster, higher quality, and more secure development and release cycles.\n\n> Download the full [Forrester Consulting “Total Economic Impact of GitLab’s Ultimate Plan” study](https://page.gitlab.com/resources-study-forrester-tei-gitlab-ultimate.html) for:\n\n* Additional benefits of GitLab Ultimate Plan\n* Testimonials from GitLab customers Forrester interviewed\n* Assumptions and risks to calculate ROI",[707,2004,9],"research",{"slug":2006,"featured":6,"template":688},"why-the-market-is-moving-to-a-platform-approach-to-devsecops","content:en-us:blog:why-the-market-is-moving-to-a-platform-approach-to-devsecops.yml","Why The Market Is Moving To A Platform Approach To Devsecops","en-us/blog/why-the-market-is-moving-to-a-platform-approach-to-devsecops.yml","en-us/blog/why-the-market-is-moving-to-a-platform-approach-to-devsecops",{"_path":2012,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":2013,"content":2019,"config":2024,"_id":2026,"_type":13,"title":2027,"_source":15,"_file":2028,"_stem":2029,"_extension":18},"/en-us/blog/why-we-are-building-the-gitlab-environment-toolkit-to-help-deploy-gitlab-at-scale",{"title":2014,"description":2015,"ogTitle":2014,"ogDescription":2015,"noIndex":6,"ogImage":2016,"ogUrl":2017,"ogSiteName":672,"ogType":673,"canonicalUrls":2017,"schema":2018},"The next step in performance testing? The GitLab Environment Toolkit","Learn how we're building a new toolkit to help with performance testing and deploying GitLab at scale.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749682030/Blog/Hero%20Images/gitlab_environment_toolkit_scale.jpg","https://about.gitlab.com/blog/why-we-are-building-the-gitlab-environment-toolkit-to-help-deploy-gitlab-at-scale","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"The next step in performance testing? The GitLab Environment Toolkit\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Grant Young\"}],\n        \"datePublished\": \"2021-06-15\",\n      }",{"title":2014,"description":2015,"authors":2020,"heroImage":2016,"date":2021,"body":2022,"category":681,"tags":2023},[1183],"2021-06-15","\n\nLast year I wrote about how the [Quality Engineering Enablement team](/handbook/engineering/quality/) was [building up the performance testing of GitLab](/blog/how-were-building-up-performance-testing-of-gitlab/) with the [GitLab Performance Tool (GPT)](https://gitlab.com/gitlab-org/quality/performance). Last year, the biggest challenge with performance testing wasn't so much the testing but rather setting up the right large scale GitLab environments to test against.\n\nLike any server application, deploying at scale is challenging. That's why we built another toolkit that automates the deployment of GitLab at scale: The [GitLab Environment Toolkit (GET)](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit).\n\n![GitLab Environment Toolkit logo](https://about.gitlab.com/images/blogimages/gitlab-environment-toolkit/gitlab_environment_toolkit_logo.png){: .center}\nGitLab Environment Toolkit logo\n{: .note.text-center}\n\nInternally called the \"Performance Environment Builder\" (PEB), GET grew alongside GPT as we continued to expand our performance testing efforts. Over time we built a toolkit that was quite capable in its own right of deploying GitLab at scale, which is why it started to gain attention internally from other teams and then even from some customers. Soon we realized we built something worth sharing.\n\nThe Quality Engineering Enablement team has been working hard over the last few months to polish the toolkit for broader use and we're happy to share that the first version of [GET v1.0.0](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/releases/v1.0.0) has been released!\n\nGET is a collection of well-known open source provisioning and configuration tools with a simple focused purpose - to deploy [GitLab Omnibus](https://gitlab.com/gitlab-org/omnibus-gitlab) and [GitLab Helm Charts](https://docs.gitlab.com/charts/) at scale, as defined by our [Reference Architectures](https://docs.gitlab.com/ee/administration/reference_architectures) and [Cloud Native Hybrid Reference Architectures](https://docs.gitlab.com/ee/administration/reference_architectures/10k_users.html#cloud-native-hybrid-reference-architecture-with-helm-charts-alternative). Built with Terraform and Ansible, GET supports the provisioning and configuring of machines and other related infrastructure and contains the following features:\n\n - Support for deploying all GitLab Reference Architectures sizes dynamically from 1000 to 50,000\n - Support for deploying Cloud Native Hybrid Reference Architectures (GCP only at this time)\n - GCP, AWS, and Azure cloud provider support\n - Upgrades\n - Release and nightly Omnibus builds support\n - Advanced search with Elasticsearch\n - Geo support\n - Zero Downtime Upgrades support\n - Built-in load balancing via HAProxy and monitoring (Prometheus, Grafana) support\n\nWe're just getting started with GET, and [continue to add more support for features and different environment setups](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/boards?group_by=epic). Now that GET [v1.0.0](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/releases/v1.0.0) has been released, we're at a good place for customers to start trialing and evaluating GET. We do ask that you take into consideration the continuing expansion of capabilities, as well as limitations of the current version.\n\nRead on to learn about the the philosophy of GET and how it works.\n\n## The design principals of GET\n\nOur team has past experience with provisioning and configuration tools, so we've learned what does and does not work, which is why we try to stick to the following goals:\n\n- GET is [boring](https://handbook.gitlab.com/handbook/values/#boring-solutions): The word boring may look funny here but it's actually a [GitLab value](https://handbook.gitlab.com/handbook/values/). A boring solution essentially means to keep it simple. Provisioning and configuration solutions can get complicated **fast** with many common pitfalls, such as trying to support complex setups that come with a heavy maintenance cost. From the very beginning we've tried to avoid this, so GET essentially uses a standard Terraform and Ansible config that doesn't try to do anything fancy or complicated.\n- GET is *not* a replacement for [GitLab Omnibus](https://gitlab.com/gitlab-org/omnibus-gitlab) or the [Helm Charts](https://docs.gitlab.com/charts/): Truly some of the greatest \"magic\" in setting up GitLab is how much easier it's made Omnibus and the Helm Charts. Thanks to the incredible work by our Distribution teams, both of these install methods do a lot under the hood, and GET is not trying to replace these. In the same [boring](https://handbook.gitlab.com/handbook/values/#boring-solutions) vein, GET's purpose is simply to set up GitLab environments at scale by installing Omnibus or Helm in the right places (along with any other needed infrastructure to support).\n- GET is one for all and designed to work for all our recommended [GitLab Reference Architectures](https://docs.gitlab.com/ee/administration/reference_architectures/). Everything we do with GET has to be considered against this goal. It means we may not be able to support niche or overly complex set ups as this will lead to complex code and heavy maintenance costs. We do aim to support recommended customizations where appropriate.\n\nNext we look at how GET works at a high level, starting with provisioning with Terraform.\n\n## Provisioning the environment with Terraform\n\nThe first step to building an environment is to provision the machines and/or Kubernetes clusters that run GitLab. We undergo this process with the well-known provisioning tool, [Terraform](https://www.terraform.io/).\n\nNext, we've created multiple [Terraform modules](https://www.terraform.io/docs/language/modules/develop/index.html) in GET for each of the main big three cloud providers (GCP, AWS and Azure) that provision machines for you, according to the appropriate [reference architectures](https://docs.gitlab.com/ee/administration/reference_architectures/), along with the necessary supporting infrastructure, such as firewalls, load balancers, etc. We designed these modules to be as simple as possible and only require minimal configuration.\n\nFor more information on the entire Terraform configuration, [check out our docs](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/blob/master/docs/environment_provision.md). An example of one of the main config files is `environment.tf`, which defines how each component's nodes should be setup. Below is an example of how it is configured with GCP for a [10k reference architecture](https://docs.gitlab.com/ee/administration/reference_architectures/10k_users.html) environment:\n\n```tf\nmodule \"gitlab_ref_arch_gcp\" {\n  source = \"../../modules/gitlab_ref_arch_gcp\"\n\n  prefix = var.prefix\n  project = var.project\n\n  object_storage_buckets = [\"artifacts\", \"backups\", \"dependency-proxy\", \"lfs\", \"mr-diffs\", \"packages\", \"terraform-state\", \"uploads\"]\n\n  # 10k\n  consul_node_count = 3\n  consul_machine_type = \"n1-highcpu-2\"\n\n  elastic_node_count = 3\n  elastic_machine_type = \"n1-highcpu-16\"\n\n  gitaly_node_count = 3\n  gitaly_machine_type = \"n1-standard-16\"\n\n  praefect_node_count = 3\n  praefect_machine_type = \"n1-highcpu-2\"\n\n  praefect_postgres_node_count = 1\n  praefect_postgres_machine_type = \"n1-highcpu-2\"\n\n  gitlab_nfs_node_count = 1\n  gitlab_nfs_machine_type = \"n1-highcpu-4\"\n\n  gitlab_rails_node_count = 3\n  gitlab_rails_machine_type = \"n1-highcpu-32\"\n\n  haproxy_external_node_count = 1\n  haproxy_external_machine_type = \"n1-highcpu-2\"\n  haproxy_external_external_ips = [var.external_ip]\n  haproxy_internal_node_count = 1\n  haproxy_internal_machine_type = \"n1-highcpu-2\"\n\n  monitor_node_count = 1\n  monitor_machine_type = \"n1-highcpu-4\"\n\n  pgbouncer_node_count = 3\n  pgbouncer_machine_type = \"n1-highcpu-2\"\n\n  postgres_node_count = 3\n  postgres_machine_type = \"n1-standard-4\"\n\n  redis_cache_node_count = 3\n  redis_cache_machine_type = \"n1-standard-4\"\n  redis_sentinel_cache_node_count = 3\n  redis_sentinel_cache_machine_type = \"n1-standard-1\"\n  redis_persistent_node_count = 3\n  redis_persistent_machine_type = \"n1-standard-4\"\n  redis_sentinel_persistent_node_count = 3\n  redis_sentinel_persistent_machine_type = \"n1-standard-1\"\n\n  sidekiq_node_count = 4\n  sidekiq_machine_type = \"n1-standard-4\"\n}\n\noutput \"gitlab_ref_arch_gcp\" {\n  value = module.gitlab_ref_arch_gcp\n}\n````\n\nWith this environment and [two other small config files in place](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/blob/master/docs/environment_provision.md#2-setup-the-environments-config) Terraform can be run normally and work its magic. Below is a snippet of the output you'll see with GCP:\n\n```\n[...]\n\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[2]: Creating...\nmodule.gitlab_ref_arch_gcp.module.pgbouncer.google_compute_instance.gitlab[2]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.pgbouncer.google_compute_instance.gitlab[0]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.consul.google_compute_instance.gitlab[1]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[1]: Creating...\nmodule.gitlab_ref_arch_gcp.module.gitlab_nfs.google_compute_instance.gitlab[0]: Creation complete after 25s\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[1]: Creating...\nmodule.gitlab_ref_arch_gcp.module.gitaly.google_compute_instance.gitlab[1]: Creation complete after 14s\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[2]: Creating...\nmodule.gitlab_ref_arch_gcp.module.gitaly.google_compute_instance.gitlab[0]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[0]: Creating...\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[0]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.pgbouncer.google_compute_instance.gitlab[1]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.pgbouncer.google_compute_instance.gitlab[2]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.pgbouncer.google_compute_instance.gitlab[0]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[0]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.gitaly.google_compute_instance.gitlab[2]: Still creating... [20s elapsed]\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[2]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[1]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[1]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[2]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[0]: Still creating... [10s elapsed]\nmodule.gitlab_ref_arch_gcp.module.gitaly.google_compute_instance.gitlab[2]: Creation complete after 25s\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[2]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.redis_sentinel_cache.google_compute_instance.gitlab[1]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[1]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[0]: Creation complete after 15s\nmodule.gitlab_ref_arch_gcp.module.redis_persistent.google_compute_instance.gitlab[2]: Creation complete after 15s\nReleasing state lock. This may take a few moments...\n\nApply complete! Resources: 90 added, 0 changed, 0 destroyed.\n```\n\nOnce it's done, you should have a full set of machines for GitLab that can be configured with Ansible, which is what we'll look at next.\n\n## How to configure the environment with Ansible\n\nThe next step for setting up the environment is configuring [Ansible](https://www.ansible.com/). In a nutshell, this tool connects to each machine via SSH and runs tasks to configure GitLab.\n\nLike with Terraform, [we've created multiple roles](https://docs.ansible.com/ansible/latest/user_guide/playbooks_reuse_roles.html) and [Playbooks](https://docs.ansible.com/ansible/latest/user_guide/playbooks_intro.html) in GET that are designed to configure each component on the intended machine. Through Terraform, we apply labels to each machine that Ansible then tracks using its [dynamic inventory](https://docs.ansible.com/ansible/latest/user_guide/intro_dynamic_inventory.html) to define the purpose of each machine.\n\nA detailed breakdown of the configuration process is available in the [GET for Ansible docs](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/blob/master/docs/environment_provision.md). But, an example one of the main config files is `environment.tf`, which defines how the nodes of each component should be setup. Below is an example of how it looks with GCP for a [10k user reference architecture](https://docs.gitlab.com/ee/administration/reference_architectures/10k_users.html) environment:\n\nLike we did before with Terraform, we'll highlight one of the main config files, but you can see the full process in the [docs](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/blob/master/docs/environment_configure.md). The file is `vars.yml`, an inventory variable file for your environment that contains various parts of the config Ansible needs to perform the setup, along with key GitLab config:\n\n```yml\nall:\n  vars:\n    # Ansible Settings\n    ansible_user: \"\u003Cssh_username>\"\n    ansible_ssh_private_key_file: \"\u003Cprivate_ssh_key_path>\"\n\n    # Cloud Settings\n    cloud_provider: \"gcp\"\n    gcp_project: \"\u003Cgcp_project_id>\"\n    gcp_service_account_host_file: \"\u003Cgcp_service_account_host_file_path>\"\n\n    # General Settings\n    prefix: \"\u003Cenvironment_prefix>\"\n    external_url: \"\u003Cexternal_url>\"\n    gitlab_license_file: \"\u003Cgitlab_license_file_path>\"\n\n    # Object Storage Settings\n    gitlab_object_storage_artifacts_bucket: \"{{ prefix }}-artifacts\"\n    gitlab_object_storage_backups_bucket: \"{{ prefix }}-backups\"\n    gitlab_object_storage_dependency_proxy_bucket: \"{{ prefix }}-dependency-proxy\"\n    gitlab_object_storage_external_diffs_bucket: \"{{ prefix }}-mr-diffs\"\n    gitlab_object_storage_lfs_bucket: \"{{ prefix }}-lfs\"\n    gitlab_object_storage_packages_bucket: \"{{ prefix }}-packages\"\n    gitlab_object_storage_terraform_state_bucket: \"{{ prefix }}-terraform-state\"\n    gitlab_object_storage_uploads_bucket: \"{{ prefix }}-uploads\"\n\n    # Passwords / Secrets - Can also be set as Environment Variables via ansible.builtin.env\n    gitlab_root_password: \"\u003Cgitlab_root_password>\"\n    grafana_password: \"\u003Cgrafana_password>\"\n    postgres_password: \"\u003Cpostgres_password>\"\n    consul_database_password: \"\u003Cconsul_database_password>\"\n    gitaly_token: \"\u003Cgitaly_token>\"\n    pgbouncer_password: \"\u003Cpgbouncer_password>\"\n    redis_password: \"\u003Credis_password>\"\n    praefect_external_token: \"\u003Cpraefect_external_token>\"\n    praefect_internal_token: \"\u003Cpraefect_internal_token>\"\n    praefect_postgres_password: \"\u003Cpraefect_postgres_password>\"\n```\n\nWith the variable file and the [environment inventory configured](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/blob/master/docs/environment_configure.md#2-setup-the-environments-dynamic-inventory) Ansible can run normally. Here is a snippet of the output you'll see with GCP:\n\n```\n[...]\n\nTASK [gitlab-rails : Update Postgres primary IP and Port] **********************\nok: [gitlab-qa-10k-gitlab-rails-1]\nTASK [gitlab-rails : Setup GitLab deploy node config file with DB Migrations] ***\nchanged: [gitlab-qa-10k-gitlab-rails-1]\nTASK [gitlab-rails : Reconfigure GitLab deploy node] ***************************\nchanged: [gitlab-qa-10k-gitlab-rails-1]\nTASK [gitlab-rails : Setup all GitLab Rails config files] **********************\nchanged: [gitlab-qa-10k-gitlab-rails-1]\nok: [gitlab-qa-10k-gitlab-rails-3]\nok: [gitlab-qa-10k-gitlab-rails-2]\nTASK [gitlab-rails : Reconfigure all GitLab Rails] *****************************\nchanged: [gitlab-qa-10k-gitlab-rails-1]\nchanged: [gitlab-qa-10k-gitlab-rails-3]\nchanged: [gitlab-qa-10k-gitlab-rails-2]\nTASK [gitlab-rails : Restart GitLab] *******************************************\nchanged: [gitlab-qa-10k-gitlab-rails-3]\nchanged: [gitlab-qa-10k-gitlab-rails-1]\nchanged: [gitlab-qa-10k-gitlab-rails-2]\n\n[...]\n\nPLAY RECAP *********************************************************************\ngitlab-qa-10k-consul-1     : ok=29   changed=17   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-consul-2     : ok=28   changed=16   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-consul-3     : ok=28   changed=16   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-elastic-1    : ok=41   changed=9    unreachable=0    failed=0    skipped=61   rescued=0    ignored=0\ngitlab-qa-10k-elastic-2    : ok=37   changed=7    unreachable=0    failed=0    skipped=62   rescued=0    ignored=0\ngitlab-qa-10k-elastic-3    : ok=37   changed=7    unreachable=0    failed=0    skipped=62   rescued=0    ignored=0\ngitlab-qa-10k-gitaly-1     : ok=27   changed=15   unreachable=0    failed=0    skipped=30   rescued=0    ignored=0\ngitlab-qa-10k-gitaly-2     : ok=26   changed=14   unreachable=0    failed=0    skipped=30   rescued=0    ignored=0\ngitlab-qa-10k-gitaly-3     : ok=26   changed=14   unreachable=0    failed=0    skipped=30   rescued=0    ignored=0\ngitlab-qa-10k-gitlab-nfs-1 : ok=28   changed=7    unreachable=0    failed=0    skipped=55   rescued=0    ignored=0\ngitlab-qa-10k-gitlab-rails-1 : ok=41   changed=21   unreachable=0    failed=0    skipped=32   rescued=0    ignored=0\ngitlab-qa-10k-gitlab-rails-2 : ok=35   changed=16   unreachable=0    failed=0    skipped=33   rescued=0    ignored=0\ngitlab-qa-10k-gitlab-rails-3 : ok=35   changed=16   unreachable=0    failed=0    skipped=33   rescued=0    ignored=0\ngitlab-qa-10k-haproxy-external-1 : ok=40   changed=8    unreachable=0    failed=0    skipped=62   rescued=0    ignored=0\ngitlab-qa-10k-haproxy-internal-1 : ok=39   changed=8    unreachable=0    failed=0    skipped=60   rescued=0    ignored=0\ngitlab-qa-10k-monitor-1    : ok=43   changed=19   unreachable=0    failed=0    skipped=35   rescued=0    ignored=0\ngitlab-qa-10k-pgbouncer-1  : ok=30   changed=17   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-pgbouncer-2  : ok=29   changed=16   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-pgbouncer-3  : ok=29   changed=16   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-postgres-1   : ok=35   changed=16   unreachable=0    failed=0    skipped=36   rescued=0    ignored=0\ngitlab-qa-10k-postgres-2   : ok=34   changed=15   unreachable=0    failed=0    skipped=36   rescued=0    ignored=0\ngitlab-qa-10k-postgres-3   : ok=34   changed=15   unreachable=0    failed=0    skipped=36   rescued=0    ignored=0\ngitlab-qa-10k-praefect-1   : ok=29   changed=18   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-praefect-2   : ok=26   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-praefect-3   : ok=26   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-praefect-postgres-1 : ok=25   changed=14   unreachable=0    failed=0    skipped=29   rescued=0    ignored=0\ngitlab-qa-10k-redis-cache-1 : ok=26   changed=15   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-cache-2 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-cache-3 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-persistent-1 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-persistent-2 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-persistent-3 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-sentinel-cache-1 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-sentinel-cache-2 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-sentinel-cache-3 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-sentinel-persistent-1 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-sentinel-persistent-2 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-redis-sentinel-persistent-3 : ok=25   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-sidekiq-1    : ok=28   changed=15   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-sidekiq-2    : ok=27   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-sidekiq-3    : ok=27   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\ngitlab-qa-10k-sidekiq-4    : ok=27   changed=14   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0\nlocalhost                  : ok=18   changed=3    unreachable=0    failed=0    skipped=38   rescued=0    ignored=0\n```\n\nOnce Ansible is done, you should have a fully running GitLab environment at scale!\n\n## What's next?\n\nWe've got a bunch of things planned for GET so it can support more features when setting up GitLab, such as SSL support, [cloud native hybrid architectures](/blog/cloud-native-architectures-made-easy/) on other cloud providers, object storage customization, and much more. We know deploying production-ready server applications is hard and has many potential requirements depending on the customer, and we hope to eventually support all recommended setups.\n\nCheck out the [GET development board](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/boards?group_by=epic) and our [issue list](https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit/-/issues) to see what is in progress. Share feedback and suggestions by adding to our issue lists, we're keen to hear what's important to customers.\n\n[Cover image](https://unsplash.com/photos/icdVDptHxpM) by [Jean Vella](https://unsplash.com/@jean_vella?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText).\n{: .note}\n",[754,9],{"slug":2025,"featured":6,"template":688},"why-we-are-building-the-gitlab-environment-toolkit-to-help-deploy-gitlab-at-scale","content:en-us:blog:why-we-are-building-the-gitlab-environment-toolkit-to-help-deploy-gitlab-at-scale.yml","Why We Are Building The Gitlab Environment Toolkit To Help Deploy Gitlab At Scale","en-us/blog/why-we-are-building-the-gitlab-environment-toolkit-to-help-deploy-gitlab-at-scale.yml","en-us/blog/why-we-are-building-the-gitlab-environment-toolkit-to-help-deploy-gitlab-at-scale",{"_path":2031,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":2032,"content":2038,"config":2043,"_id":2045,"_type":13,"title":2046,"_source":15,"_file":2047,"_stem":2048,"_extension":18},"/en-us/blog/why-we-created-the-gitlab-memory-team",{"title":2033,"description":2034,"ogTitle":2033,"ogDescription":2034,"noIndex":6,"ogImage":2035,"ogUrl":2036,"ogSiteName":672,"ogType":673,"canonicalUrls":2036,"schema":2037},"Why we created a Memory team at GitLab","GitLab has a memory problem, so we created a specialized team to fix it.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749678549/Blog/Hero%20Images/memory_team_arie-wubben.jpg","https://about.gitlab.com/blog/why-we-created-the-gitlab-memory-team","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Why we created a Memory team at GitLab\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Sara Kassabian\"}],\n        \"datePublished\": \"2019-09-13\",\n      }",{"title":2033,"description":2034,"authors":2039,"heroImage":2035,"date":2040,"body":2041,"category":681,"tags":2042},[1447],"2019-09-13","\nGitLab is an [all-in-one DevOps solution](/topics/devops/) with a growing feature set. But as more features are added to the application, more memory is required. Some users have reportedly elected to migrate to other tools because the memory footprint required to run a minimum GitLab instance was exorbitant:\n\n> “GitLab is great and I have used it for years but I recently switched to Gogs for self-hosted repositories because it is much faster, easier to set up, and walk in a park to maintain. It doesn't have all the features (bloat) that GitLab has but it can probably satisfy >95% of Git users.” – [Jnr on HackerNews](https://news.ycombinator.com/item?id=19227935)\n\n> “If GitLab grows any more features I'll be moving away simply to ensure confidence that I understand my own infrastructure in the limited time I have to maintain it. It's the weirdest kind of success problem to have, but the truth is if it wasn't such a pain to make the move, I'd have transitioned away from GitLab six months ago.” – [Sir_Substance on HackerNews](https://news.ycombinator.com/item?id=19230557)\n\n## Step 1: Establish priorities to solve our memory problem\n\nWe created the [GitLab Memory team](/handbook/engineering/development/enablement/data_stores/application_performance/) to tackle this performance challenge. The aim of the Memory team is to [reduce the minimum instance for GitLab from 8GB](https://gitlab.com/gitlab-org/gitlab-ce/commit/0cd5d968038d6d64d95add0bbe3d63d8fcfdc23b) to 1GB of RAM. By reducing the memory required to run GitLab to 1GB, [our application can run anywhere](https://gitlab.com/groups/gitlab-org/-/epics/448), even on inexpensive commodity computers like an unaltered [Raspberry Pi 3 Model B+](https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/).\n\nThere is no quick fix for reducing GitLab’s memory footprint, but the team has started by investigating memory and performance bottlenecks, gathering data, and prioritizing activities for the next three to four months based on these results.\n\n“We know we have memory issues to address, but we need more data to determine the source, the impact and how to best approach the problem,” says [Craig Gomes](/company/team/#craiggomes), memory engineering manager.\n\n[Kamil Trzciński](/company/team/#ayufanpl), distinguished engineer and memory specialist at GitLab, says the top three priorities for the Memory team fall into three distinct buckets:\n\n1. [Moving over to Puma](https://gitlab.com/groups/gitlab-org/-/epics/954)\n1. [Perform the low-level exercise by optimizing endpoints](https://gitlab.com/groups/gitlab-org/-/epics/448)\n1. [Improving our development practices](https://gitlab.com/groups/gitlab-org/-/epics/1415)\n\n### Migrating from Unicorn to Puma\n\nPreliminary research shows that the bulk of GitLab’s memory usage comes from running web application processes on Unicorn.\n\n“Each Web application process (Unicorn) can take 500 MB of RAM, and it can handle a single request at a time. The more users and traffic we need to support, the more processes and hence RAM we need,” says [Stan Hu](/company/team/#stanhu), engineering fellow at GitLab.\n\nOne of the first projects the Memory team is tackling is testing to see if migrating from Unicorn to Puma will reduce GitLab’s memory footprint. Both Unicorn and Puma are multi-threaded HTTP web servers that run on Rails, but unlike Unicorn, Puma is threaded and does not require as much memory.\n\nThe Memory team has successfully [configured Puma to work on dev.gitlab.com](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/82) to test its functionality and measure its memory reduction. The next big project in this domain is to [enable Puma on GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/954).\n\n### Dig deeper into what's causing memory issues for GitLab.com\n\nBefore GitLab is able to run on less memory, the team needs to fix the memory problems we know about already on GitLab.com. One of these problems is the memory killer on open source background processor, Sidekiq.\n\n\"If a Sidekiq job runs, takes too much memory, and then gets killed, jobs in the queue will be retried indefinitely,\" says Stan. The team is working to fix this, along with other priority one problems with memory usage in [project import](https://gitlab.com/gitlab-org/gitlab-ce/issues/59754) and [exports](https://gitlab.com/gitlab-org/gitlab-ce/issues/35389) in the 12.3 release.\n\n### Improve development practices around memory usage\n\nThe Memory team created a massive epic that aims to capture related [development work focusing on making improvements to internal dev practices around code complexity and memory usage](https://gitlab.com/groups/gitlab-org/-/epics/1415).\n\n\"The reason behind that is to enable everyone during development to understand the impact of introducing new changes to the application,\" says Kamil in the [epic](https://gitlab.com/groups/gitlab-org/-/epics/1415). Some of the projects they are working on for the 12.3 release include [testing more endpoints using typical GitLab user scenarios (e.g. commenting on a MR)](https://gitlab.com/gitlab-org/quality/performance/issues/34) and set up a [performance monitoring solution across different environments](https://gitlab.com/gitlab-org/quality/performance/issues/37).\n\n## Step 2: Create a team to fix the memory problem\n\nWe need a specialized engineering team to assess the scope of the problem and identify solutions to reduce GitLab’s memory requirements.\n\n“Right now we have a very small team with two brand new team members,” says Craig. “The team is getting up to speed quickly and there is a lot of excitement about the potential of the team that more work keeps coming our way. It's a great challenge to have, and having more experienced engineers on the team will help us to achieve our goals.”\n\nThe current memory team is small but mighty. We have [Craig](/company/team/#craiggomes), the engineering manager, and three engineers on the permanent memory team: [Kamil](/company/team/#ayufanpl), [Qingyu Zhao](/company/team/#qzhaogitlab), and [Aleksei Lipniagov](/company/team/#alipniagov). The team works closely with senior product manager for distribution and memory, [Larissa Lane](/company/team/#ljlane). [We’re looking for more qualified people to join our team](https://handbook.gitlab.com/job-families/engineering/backend-engineer/#memory).\n\nThe Memory team is actively hiring engineers to help us enhance GitLab’s performance, but we have a high rejection rate because we require a specific, hard-to-find skill set. A [top priority for the Memory team is hiring at least one senior engineer in FY20-Q3](https://gitlab.com/gitlab-com/www-gitlab-com/issues/4885), which will allow us to take on a bigger workload as we move toward our goal of getting GitLab running on less than 1GB.\n\nFollow along with the Memory team by [subscribing to their channel on GitLab Unfiltered](https://www.youtube.com/playlist?list=PL05JrBw4t0Kq_5ZWIHYfbcAYjtXYcEZA3).\n\nCover photo by [Arie Wubben](https://unsplash.com/@condorito1953?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) on [Unsplash](https://unsplash.com/search/photos/airplane?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)\n{: .note}\n",[754,9,864],{"slug":2044,"featured":6,"template":688},"why-we-created-the-gitlab-memory-team","content:en-us:blog:why-we-created-the-gitlab-memory-team.yml","Why We Created The Gitlab Memory Team","en-us/blog/why-we-created-the-gitlab-memory-team.yml","en-us/blog/why-we-created-the-gitlab-memory-team",{"_path":2050,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":2051,"content":2057,"config":2065,"_id":2067,"_type":13,"title":2068,"_source":15,"_file":2069,"_stem":2070,"_extension":18},"/en-us/blog/why-we-spent-the-last-month-eliminating-postgresql-subtransactions",{"title":2052,"description":2053,"ogTitle":2052,"ogDescription":2053,"noIndex":6,"ogImage":2054,"ogUrl":2055,"ogSiteName":672,"ogType":673,"canonicalUrls":2055,"schema":2056},"Why we spent the last month eliminating PostgreSQL subtransactions","How a mysterious stall in database queries uncovered a performance limitation with PostgreSQL.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749669470/Blog/Hero%20Images/nessie.jpg","https://about.gitlab.com/blog/why-we-spent-the-last-month-eliminating-postgresql-subtransactions","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Why we spent the last month eliminating PostgreSQL subtransactions\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Grzegorz Bizon\"},{\"@type\":\"Person\",\"name\":\"Stan Hu\"}],\n        \"datePublished\": \"2021-09-29\",\n      }",{"title":2052,"description":2053,"authors":2058,"heroImage":2054,"date":2061,"body":2062,"category":681,"tags":2063},[2059,2060],"Grzegorz Bizon","Stan Hu","2021-09-29","\nSince last June, we noticed the database on GitLab.com would\nmysteriously stall for minutes, which would lead to users seeing 500\nerrors during this time. Through a painstaking investigation over\nseveral weeks, we finally uncovered the cause of this: initiating a\nsubtransaction via the [`SAVEPOINT` SQL query](https://www.postgresql.org/docs/current/sql-savepoint.html) while\na long transaction is in progress can wreak havoc on database\nreplicas. Thus launched a race, which we recently completed, to\neliminate all `SAVEPOINT` queries from our code. Here's what happened,\nhow we discovered the problem, and what we did to fix it.\n\n### The symptoms begin\n\nOn June 24th, we noticed that our CI/CD runners service reported a high\nerror rate:\n\n![runners errors](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/ci-runners-errors.png)\n\nA quick investigation revealed that database queries used to retrieve\nCI/CD builds data were timing out and that the unprocessed builds\nbacklog grew at a high rate:\n\n![builds queue](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/builds-queue.png)\n\nOur monitoring also showed that some of the SQL queries were waiting for\nPostgreSQL lightweight locks (`LWLocks`):\n\n![aggregated lwlocks](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/aggregated-lwlocks.png)\n\nIn the following weeks we had experienced a few incidents like this. We were\nsurprised to see how sudden these performance degradations were, and how\nquickly things could go back to normal:\n\n![ci queries latency](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/ci-queries-latency.png)\n\n### Introducing Nessie: Stalled database queries\n\nIn order to learn more, we extended our observability tooling [to sample\nmore data from `pg_stat_activity`](https://gitlab.com/gitlab-cookbooks/gitlab-exporters/-/merge_requests/231). In PostgreSQL, the `pg_stat_activity`\nvirtual table contains the list of all database connections in the system as\nwell as what they are waiting for, such as a SQL query from the\nclient. We observed a consistent pattern: the queries were waiting on\n`SubtransControlLock`. Below shows a graph of the URLs or jobs that were\nstalled:\n\n![endpoints locked](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/endpoints-locked.png)\n\nThe purple line shows the sampled number of transactions locked by\n`SubtransControlLock` for the `POST /api/v4/jobs/request` endpoint that\nwe use for internal communication between GitLab and GitLab Runners\nprocessing CI/CD jobs.\n\nAlthough this endpoint was impacted the most, the whole database cluster\nappeared to be affected as many other, unrelated queries timed out.\n\nThis same pattern would rear its head on random days. A week would pass\nby without incident, and then it would show up for 15 minutes and\ndisappear for days. Were we chasing the Loch Ness Monster?\n\nLet's call these stalled queries Nessie for fun and profit.\n\n### What is a `SAVEPOINT`?\n\nTo understand `SubtransControlLock` ([PostgreSQL\n13](https://www.postgresql.org/docs/13/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW)\nrenamed this to `SubtransSLRU`), we first must understand how\nsubtransactions work in PostgreSQL. In PostgreSQL, a transaction can\nstart via a `BEGIN` statement, and a subtransaction can be started with\na subsequent `SAVEPOINT` query. PostgreSQL assigns each of these a\ntransaction ID (XID for short) [when a transaction or a subtransaction\nneeds one, usually before a client modifies data](https://gitlab.com/postgres/postgres/blob/a00c138b78521b9bc68b480490a8d601ecdeb816/src/backend/access/transam/README#L193-L198).\n\n#### Why would you use a `SAVEPOINT`?\n\nFor example, let's say you were running an online store and a customer\nplaced an order. Before the order is fullfilled, the system needs to\nensure a credit card account exists for that user. In Rails, a common\npattern is to start a transaction for the order and call\n[`find_or_create_by`](https://apidock.com/rails/v5.2.3/ActiveRecord/Relation/find_or_create_by). For\nexample:\n\n```ruby\nOrder.transaction do\n  begin\n    CreditAccount.transaction(requires_new: true) do\n      CreditAccount.find_or_create_by(customer_id: customer.id)\n  rescue ActiveRecord::RecordNotUnique\n    retry\n  end\n  # Fulfill the order\n  # ...\nend\n```\n\nIf two orders were placed around the same time, you wouldn't want the\ncreation of a duplicate account to fail one of the orders. Instead, you\nwould want the system to say, \"Oh, an account was just created; let me\nuse that.\"\n\nThat's where subtransactions come in handy: the `requires_new: true`\ntells Rails to start a new subtransaction if the application already is\nin a transaction. The code above translates into several SQL calls that\nlook something like:\n```sql\n--- Start a transaction\nBEGIN\nSAVEPOINT active_record_1\n--- Look up the account\nSELECT * FROM credit_accounts WHERE customer_id = 1\n--- Insert the account; this may fail due to a duplicate constraint\nINSERT INTO credit_accounts (customer_id) VALUES (1)\n--- Abort this by rolling back\nROLLBACK TO active_record_1\n--- Retry here: Start a new subtransaction\nSAVEPOINT active_record_2\n--- Find the newly-created account\nSELECT * FROM credit_accounts WHERE customer_id = 1\n--- Save the data\nRELEASE SAVEPOINT active_record_2\nCOMMIT\n```\n\nOn line 7 above, the `INSERT` might fail if the customer account was\nalready created, and the database unique constraint would prevent a\nduplicate entry. Without the first `SAVEPOINT` and `ROLLBACK` block, the\nwhole transaction would have failed. With that subtransaction, the\ntransaction can retry gracefully and look up the existing account.\n\n### What is `SubtransControlLock`?\n\nAs we mentioned earlier, Nessie returned at random times with queries\nwaiting for `SubtransControlLock`. `SubtransControlLock` indicates that\nthe query is waiting for PostgreSQL to load subtransaction data from\ndisk into shared memory.\n\nWhy is this data needed? When a client runs a `SELECT`, for example,\nPostgreSQL needs to decide whether each version of a row, known as a\ntuple, is actually visible within the current transaction. It's possible\nthat a tuple has been deleted or has yet to be committed by another\ntransaction. Since only a top-level transaction can actually commit\ndata, PostgreSQL needs to map a subtransaction ID (subXID) to its parent\nXID.\n\nThis mapping of subXID to parent XID is stored on disk in the\n`pg_subtrans` directory. Since reading from disk is slow, PostgreSQL\nadds a simple least-recently used (SLRU) cache in front for each\nbackend process. The lookup is fast if the desired page is already\ncached. However, as [Laurenz Albe discussed in his blog\npost](https://www.cybertec-postgresql.com/en/subtransactions-and-performance-in-postgresql/),\nPostgreSQL may need to read from disk if the number of active\nsubtransactions exceeds 64 in a given transaction, a condition\nPostgreSQL terms `suboverflow`. Think of it as the feeling you might get\nif you ate too many Subway sandwiches.\n\nSuboverflowing (is that a word?) can bog down performance because as\nLaurenz said, \"Other transactions have to update `pg_subtrans` to\nregister subtransactions, and you can see in the perf output how they\nvie for lightweight locks with the readers.\"\n\n### Hunting for nested subtransactions\n\nLaurenz's blog post suggested that we might be using too many\nsubtransactions in one transaction. At first, we suspected we might be\ndoing this in some of our expensive background jobs, such as project\nexport or import. However, while we did see numerous `SAVEPOINT` calls\nin these jobs, we didn't see an unusual degree of nesting in local\ntesting.\n\nTo isolate the cause, we started by [adding Prometheus metrics to track\nsubtransactions as a Prometheus metric by model](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/66477).\nThis led to nice graphs as the following:\n\n![subtransactions plot](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/subtransactions-plot.png)\n\nWhile this was helpful in seeing the rate of subtransactions over time,\nwe didn't see any obvious spikes that occurred around the time of the\ndatabase stalls. Still, it was possible that suboverflow was happening.\n\nTo see if that was happening, we [instrumented our application to track\nsubtransactions and log a message whenever we detected more than 32\n`SAVEPOINT` calls in a given transaction](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/67918). Rails\nmakes it possible for the application to subscribe to all of its SQL\nqueries via `ActiveSupport` notifications. Our instrumentation looked\nsomething like this, simplified for the purposes of discussion:\n\n```ruby\nActiveSupport::Notifications.subscribe('sql.active_record') do |event|\n  sql = event.payload.dig(:sql).to_s\n  connection = event.payload[:connection]\n  manager = connection&.transaction_manager\n\n  context = manager.transaction_context\n  return if context.nil?\n\n  if sql.start_with?('BEGIN')\n    context.set_depth(0)\n  elsif cmd.start_with?('SAVEPOINT', 'EXCEPTION')\n    context.increment_savepoints\n  elsif cmd.start_with?('ROLLBACK TO SAVEPOINT')\n    context.increment_rollbacks\n  elsif cmd.start_with?('RELEASE SAVEPOINT')\n    context.increment_releases\n  elsif sql.start_with?('COMMIT', 'ROLLBACK')\n    context.finish_transaction\n  end\nend\n```\n\nThis code looks for the key SQL commands that initiate transactions and\nsubtransactions and increments counters when they occurred. After a\n`COMMIT,` we log a JSON message that contained the backtrace and the\nnumber of `SAVEPOINT` and `RELEASES` calls. For example:\n\n```json\n{\n  \"sql\": \"/*application:web,correlation_id:01FEBFH1YTMSFEEHS57FA8C6JX,endpoint_id:POST /api/:version/projects/:id/merge_requests/:merge_request_iid/approve*/ BEGIN\",\n  \"savepoints_count\": 1,\n  \"savepoint_backtraces\": [\n    [\n      \"app/models/application_record.rb:75:in `block in safe_find_or_create_by'\",\n      \"app/models/application_record.rb:75:in `safe_find_or_create_by'\",\n      \"app/models/merge_request.rb:1859:in `ensure_metrics'\",\n      \"ee/lib/analytics/merge_request_metrics_refresh.rb:11:in `block in execute'\",\n      \"ee/lib/analytics/merge_request_metrics_refresh.rb:10:in `each'\",\n      \"ee/lib/analytics/merge_request_metrics_refresh.rb:10:in `execute'\",\n      \"ee/app/services/ee/merge_requests/approval_service.rb:57:in `calculate_approvals_metrics'\",\n      \"ee/app/services/ee/merge_requests/approval_service.rb:45:in `block in create_event'\",\n      \"ee/app/services/ee/merge_requests/approval_service.rb:43:in `create_event'\",\n      \"app/services/merge_requests/approval_service.rb:13:in `execute'\",\n      \"ee/app/services/ee/merge_requests/approval_service.rb:14:in `execute'\",\n      \"lib/api/merge_request_approvals.rb:58:in `block (3 levels) in \u003Cclass:MergeRequestApprovals>'\",\n    ]\n  \"rollbacks_count\": 0,\n  \"releases_count\": 1\n}\n```\n\nThis log message contains not only the number of subtransactions via\n`savepoints_count`, but it also contains a handy backtrace that\nidentifies the exact source of the problem. The `sql` field also\ncontains [Marginalia comments](https://github.com/basecamp/marginalia)\nthat we tack onto every SQL query. These comments make it possible to\nidentify what HTTP request initiated the SQL query.\n\n### Taking a hard look at PostgreSQL\n\nThe new instrumentation showed that while the application regularly used\nsubtransactions, it never exceeded 10 nested `SAVEPOINT` calls.\n\nMeanwhile, [Nikolay Samokhvalov](https://gitlab.com/NikolayS), founder\nof [Postgres.ai](https://postgres.ai/), performed a battery of tests [trying to replicate the problem](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/20).\nHe replicated Laurenz's results when a single transaction exceeded 64\nsubtransactions, but that wasn't happening here.\n\nWhen the database stalls occurred, we observed a number of patterns:\n\n1. Only the replicas were affected; the primary remained unaffected.\n1. There was a long-running transaction, usually relating to\nPostgreSQL's autovacuuming, during the time. The stalls stopped quickly after the transaction ended.\n\nWhy would this matter? Analyzing the PostgreSQL source code, Senior\nSupport Engineer [Catalin Irimie](https://gitlab.com/cat) [posed an\nintriguing question that led to a breakthrough in our understanding](https://gitlab.com/gitlab-org/gitlab/-/issues/338410#note_652056284):\n\n> Does this mean that, having subtransactions spanning more than 32 cache pages, concurrently, would trigger the exclusive SubtransControlLock because we still end up reading them from the disk?\n\n### Reproducing the problem with replicas\n\nTo answer this, Nikolay immediately modified his test [to involve replicas and long-running transactions](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/21#note_653453774). Within a day, he reproduced the problem:\n\n![Nikolay experiment](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/nikolay-experiment.png)\n\nThe image above shows that transaction rates remain steady around\n360,000 transactions per second (TPS). Everything was proceeding fine\nuntil the long-running transaction started on the primary. Then suddenly\nthe transaction rates plummeted to 50,000 TPS on the replicas. Canceling\nthe long transaction immediately caused the transaction rate to return.\n\n### What is going on here?\n\nIn his blog post, Nikolay called the problem [Subtrans SLRU overflow](https://v2.postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful#problem-4-subtrans-slru-overflow).\nIn a busy database, it's possible for the size of the subtransaction log\nto grow so large that the working set no longer fits into memory. This\nresults in a lot of cache misses, which in turn causes a high amount of\ndisk I/O and CPU as PostgreSQL furiously tries to load data from disk to\nkeep up with all the lookups.\n\nAs mentioned earlier, the subtransaction cache holds a mapping of the\nsubXID to the parent XID. When PostgreSQL needs to look up the subXID,\nit calculates in which memory page this ID would live, and then does a\nlinear search to find in the memory page. If the page is not in the\ncache, it evicts one page and loads the desired one into memory. The\ndiagram below shows the memory layout of the subtransaction SLRU.\n\n![Subtrans SLRU](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/subtrans-slru.png)\n\nBy default, each SLRU page is an 8K buffer holding 4-byte parent\nXIDs. This means 8192/4 = 2048 transaction IDs can be stored in each\npage.\n\nNote that there may be gaps in each page. PostgreSQL will cache XIDs as\nneeded, so a single XID can occupy an entire page.\n\nThere are 32 (`NUM_SUBTRANS_BUFFERS`) pages, which means up to 65K\ntransaction IDs can be stored in memory. Nikolay demonstrated that in a\nbusy system, it took about 18 seconds to fill up all 65K entries. Then\nperformance dropped off a cliff, making the database replicas unusable.\n\nTo our surprise, our experiments also demonstrated that a single\n`SAVEPOINT` during a long-transaction [could initiate this problem if\nmany writes also occurred simultaneously](https://gitlab.com/gitlab-org/gitlab/-/issues/338865#note_655312474). That\nis, it wasn't enough just to reduce the frequency of `SAVEPOINT`; we had\nto eliminate them completely.\n\n#### Why does a single `SAVEPOINT` cause problems?\n\nTo answer this question, we need to understand what happens when a\n`SAVEPOINT` occurs in one query while a long-running transaction is\nrunning.\n\nWe mentioned earlier that PostgreSQL needs to decide whether a given row\nis visible to support a feature called [multi-version concurrency control](https://www.postgresql.org/docs/current/mvcc.html), or MVCC for\nshort. It does this by storing hidden columns, `xmin` and `xmax`, in\neach tuple.\n\n`xmin` holds the XID of when the tuple was created, and `xmax` holds the\nXID when it was marked as dead (0 if the row is still present). In\naddition, at the beginning of a transaction, PostgreSQL records metadata\nin a database snapshot. Among other items, this snapshot records the\noldest XID and the newest XID in its own `xmin` and `xmax` values.\n\nThis metadata helps [PostgreSQL determine whether a tuple is visible](https://www.interdb.jp/pg/pgsql05.html).\nFor example, a committed XID that started before `xmin` is definitely\nvisible, while anything after `xmax` is invisible.\n\n### What does this have to do with long transactions?\n\nLong transactions are bad in general because they can tie up\nconnections, but they can cause a subtly different problem on a\nreplica. On the replica, a single `SAVEPOINT` during a long transaction\ncauses a snapshot to suboverflow. Remember that dragged down performance\nin the case where we had more than 64 subtransactions.\n\nFundamentally, the problem happens because a replica behaves differently\nfrom a primary when creating snapshots and checking for tuple\nvisibility. The diagram below illustrates an example with some of the\ndata structures used in PostgreSQL:\n\n![Diagram of subtransaction handling in replicas](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/pg-replica-subtransaction-diagram.png)\n\nOn the top of this diagram, we can see the XIDs increase at the\nbeginning of a subtransaction: the `INSERT` after the `BEGIN` gets 1,\nand the subsequent `INSERT` in `SAVEPOINT` gets 2. Another client comes\nalong and performs a `INSERT` and `SELECT` at XID 3.\n\nOn the primary, PostgreSQL stores the transactions in progress in a\nshared memory segment. The process array (`procarray`) stores XID 1 with\nthe first connection, and the database also writes that information to\nthe `pg_xact` directory. XID 2 gets stored in the `pg_subtrans`\ndirectory, mapped to its parent, XID 1.\n\nIf a read happens on the primary, the snapshot generated contains `xmin`\nas 1, and `xmax` as 3. `txip` holds a list of transactions in progress,\nand `subxip` holds a list of subtransactions in progress.\n\nHowever, neither the `procarray` nor the snapshot are shared directly\nwith the replica. The replica receives all the data it needs from the\nwrite-ahead log (WAL).\n\nPlaying the WAL back one entry at time, the replica populates a shared data\nstructure called `KnownAssignedIds`. It contains all the transactions in\nprogress on the primary. Since this structure can only hold a limited number of\nIDs, a busy database with a lot of active subtransactions could easily fill\nthis buffer. PostgreSQL made a design choice to kick out all subXIDs from this\nlist and store them in the `pg_subtrans` directory.\n\nWhen a snapshot is generated on the replica, notice how `txip` is\nblank. A PostgreSQL replica treats **all** XIDs as though they are\nsubtransactions and throws them into the `subxip` bucket. That works\nbecause if a XID has a parent XID, then it's a subtransaction. Otherwise, it's a normal transaction. [The code comments\nexplain the rationale](https://gitlab.com/postgres/postgres/blob/9f540f840665936132dd30bd8e58e9a67e648f22/src/backend/storage/ipc/procarray.c#L1665-L1681).\n\nHowever, this means the snapshot is missing subXIDs, and that could be\nbad for MVCC. To deal with that, the [replica also updates `lastOverflowedXID`](https://gitlab.com/postgres/postgres/blob/9f540f840665936132dd30bd8e58e9a67e648f22/src/backend/storage/ipc/procarray.c#L3176-L3182):\n\n```c\n * When we throw away subXIDs from KnownAssignedXids, we need to keep track of\n * that, similarly to tracking overflow of a PGPROC's subxids array.  We do\n * that by remembering the lastOverflowedXID, ie the last thrown-away subXID.\n * As long as that is within the range of interesting XIDs, we have to assume\n * that subXIDs are missing from snapshots.  (Note that subXID overflow occurs\n * on primary when 65th subXID arrives, whereas on standby it occurs when 64th\n * subXID arrives - that is not an error.)\n```\n\nWhat is this \"range of interesting XIDs\"? We can see this in [the code below](https://gitlab.com/postgres/postgres/blob/4bf0bce161097869be5a56706b31388ba15e0113/src/backend/storage/ipc/procarray.c#L1702-L1703):\n\n```c\nif (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))\n    suboverflowed = true;\n```\n\nIf `lastOverflowedXid` is smaller than our snapshot's `xmin`, it means\nthat all subtransactions have completed, so we don't need to check for\nsubtransactions. However, in our example:\n\n1. `xmin` is 1 because of the transaction.\n2. `lastOverflowXid` is 2 because of the `SAVEPOINT`.\n\nThis means `suboverflowed` is set to `true` here, which tells PostgreSQL\nthat whenever a XID needs to be checked, check to see if it has a parent\nXID. Remember that this causes PostgreSQL to:\n\n1. Look up the subXID for the parent XID in the SLRU cache.\n1. If this doesn't exist in the cache, fetch the data from `pg_trans`.\n\nIn a busy system, the requested XIDs could span an ever-growing range of\nvalues, which could easily exhaust the 64K entries in the SLRU\ncache. This range will continue to grow as long as the transaction runs;\nthe rate of increase depends on how many updates are happening on the\nprmary. As soon as the transaction terminates, the `suboverflowed` state\ngets set to `false`.\n\nIn other words, we've replicated the same conditions as we saw with 64\nsubtransactions, only with a single `SAVEPOINT` and a long transaction.\n\n### What can we do about getting rid of Nessie?\n\nThere are three options:\n\n1. Eliminate `SAVEPOINT` calls completely.\n1. Eliminate all long-running transactions.\n1. Apply [Andrey Borodin's patches to PostgreSQL and increase the subtransaction cache](https://www.postgresql.org/message-id/flat/494C5E7F-E410-48FA-A93E-F7723D859561%40yandex-team.ru#18c79477bf7fc44a3ac3d1ce55e4c169).\n\nWe chose the first option because most uses of subtransaction could be\nremoved fairly easily. There were a [number of approaches](https://gitlab.com/groups/gitlab-org/-/epics/6540) we took:\n\n1. Perform updates outside of a subtransaction. Examples: [1](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68471), [2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68690)\n1. Rewrite a query to use a `INSERT` or an `UPDATE` with an `ON CONFLICT` clause to deal with duplicate constraint violations. Examples: [1](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68433), [2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/69240), [3](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68509)\n1. Live with a non-atomic `find_or_create_by`. We used this approach sparingly. Example: [1](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/68649)\n\nIn addition, we added [an alert whenever the application used a a single `SAVEPOINT`](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3881):\n\n![subtransaction alert](https://about.gitlab.com/images/blogimages/postgresql-subtransactions/subtransactions-alert-example.png)\n\nThis had the side benefit of flagging a [minor bug](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70889).\n\n#### Why not eliminate all long-running transactions?\n\nIn our database, it wasn't practical to eliminate all long-running\ntransactions because we think many of them happened via [database\nautovacuuming](https://www.postgresql.org/docs/current/runtime-config-autovacuum.html),\nbut [we're not able to reproduce this yet](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/21#note_669698320).\nWe are working on partitioning the tables and sharding the database, but this is a much more time-consuming problem\nthan removing all subtransactions.\n\n#### What about the PostgreSQL patches?\n\nAlthough we tested Andrey's PostgreSQL patches, we did not feel comfortable\ndeviating from the official PostgreSQL releases. Plus, maintaining a\ncustom patched release over upgrades would add a significant maintenance\nburden for our infrastructure team. Our self-managed customers would\nalso not benefit unless they used a patched database.\n\nAndrey's patches do two main things:\n\n1. Allow administrators to change the SLRU size to any value.\n1. Adds an [associative cache](https://www.youtube.com/watch?v=A0vR-ks3hsQ).\nto make it performant to use a large cache value.\n\nRemember that the SLRU cache does a linear search for the desired\npage. That works fine when there are only 32 pages to search, but if you\nincrease the cache size to 100 MB the search becomes much more\nexpensive. The associative cache makes the lookup fast by indexing pages\nwith a bitmask and looking up the entry with offsets from the remaining\nbits. This mitigates the problem because a transaction would need to be\nseveral magnitudes longer to cause a problem.\n\nNikolay demonstrated that the `SAVEPOINT` problem disappeared as soon as\nwe increased the SLRU size to 100 MB with those patches. With a 100 MB\ncache, PostgreSQL can cache 26.2 million IDs (104857600/4), far more\nthan the measely 65K.\n\nThese [patches are currently awaiting review](https://postgres.ai/blog/20210831-postgresql-subtransactions-considered-harmful#ideas-for-postgresql-development),\nbut in our opinion they should be given high priority for PostgreSQL 15.\n\n### Conclusion\n\nSince removing all `SAVEPOINT` queries, we have not seen Nessie rear her\nhead again. If you are running PostgreSQL with read replicas, we\nstrongly recommend that you also remove *all* subtransactions until\nfurther notice.\n\nPostgreSQL is a fantastic database, and its well-commented code makes it\npossible to understand its limitations under different configurations.\n\nWe would like to thank the GitLab community for bearing with us while we\niron out this production issue.\n\nWe are also grateful for the support from [Nikolay\nSamokhvalov](https://gitlab.com/NikolayS) and [Catalin\nIrimie](https://gitlab.com/cat), who contributed to understanding where our\nLoch Ness Monster was hiding.\n\nCover image by [Khadi Ganiev](https://www.istockphoto.com/portfolio/Ganiev?mediatype=photography) on [iStock](https://istock.com), licensed under [standard license](https://www.istockphoto.com/legal/license-agreement)\n",[9,2064,1885],"contributors",{"slug":2066,"featured":6,"template":688},"why-we-spent-the-last-month-eliminating-postgresql-subtransactions","content:en-us:blog:why-we-spent-the-last-month-eliminating-postgresql-subtransactions.yml","Why We Spent The Last Month Eliminating Postgresql Subtransactions","en-us/blog/why-we-spent-the-last-month-eliminating-postgresql-subtransactions.yml","en-us/blog/why-we-spent-the-last-month-eliminating-postgresql-subtransactions",{"_path":2072,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":2073,"content":2079,"config":2085,"_id":2087,"_type":13,"title":2088,"_source":15,"_file":2089,"_stem":2090,"_extension":18},"/en-us/blog/why-we-use-rails-to-build-gitlab",{"title":2074,"description":2075,"ogTitle":2074,"ogDescription":2075,"noIndex":6,"ogImage":2076,"ogUrl":2077,"ogSiteName":672,"ogType":673,"canonicalUrls":2077,"schema":2078},"Why we use Ruby on Rails to build GitLab","Here's our CEO on GitLab’s inception using Rails, and how challenges are being handled along the way.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749668296/Blog/Hero%20Images/gitlab-ruby.jpg","https://about.gitlab.com/blog/why-we-use-rails-to-build-gitlab","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"Why we use Ruby on Rails to build GitLab\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Aricka Flowers\"}],\n        \"datePublished\": \"2018-10-29\",\n      }",{"title":2074,"description":2075,"authors":2080,"heroImage":2076,"date":2082,"body":2083,"category":298,"tags":2084},[2081],"Aricka Flowers","2018-10-29","\nWhen our Co-founder and Engineering Fellow [Dmitriy Zaporozhets](/company/team/#dzaporozhets) decided to build GitLab, he chose to do it with Ruby on Rails, despite working primarily in PHP at the time. GitHub, a source of inspiration for GitLab, was also based on Rails, making it a logical pick considering his interest in the framework. GitLab CEO [Sid Sijbrandij](/company/team/#sytses) thinks his co-founder made a good choice:\n\n\"It's worked out really well because the Ruby on Rails ecosystem allows you to shape a lot of functionality at a high quality,\" he explained. \"If you look at GitLab, it has an enormous amount of functionality. Software development is very complex and to help with that, we need a lot of functionality and Ruby on Rails is a way to do it. Because there's all these best practices that are on your happy path, it’s also a way to keep the code consistent when you ship something like GitLab. You're kind of guided into doing the right thing.\"\n\n### Depending on useful gems\n\nRuby gems play an integral role in the building of GitLab, with it loading more than a thousand non-unique gems, according to Sid. Calling the Ruby on Rails framework \"very opinionated,\" he thinks it's a strong environment in which to build a complex app like GitLab.\n\n\"There's a great ecosystem around it with gems that can make assumptions about how you're doing things and in that regard, I think the Ruby on Rails ecosystem is still without par,\" he says. \"If you look at our Gemfile, it gives you an indication of how big the tower is of dependencies that we can build on. Ruby on Rails has amazing shoulders to stand on and it would have been much slower to develop GitLab in any other framework.\"\n\n### Overcoming challenges\n\nAll of this is not to say there haven’t been challenges in building GitLab with Ruby on Rails. Performance has been an issue that our developers have made strides to improve in a number of ways, including rewriting code in Go and [using the Vue framework](/blog/why-we-chose-vue/). The latter is being used to rewrite frequently accessed pages, like issues and merge requests, so they load faster, improving user experience.\n\nGo is being used to address other issues affecting load times and reduce memory usage.\n\n\"Ruby was optimized for the developer, not for running it in production,\" says Sid. \"For the things that get hit a lot and have to be very performant or that, for example, have to wait very long on a system IO, we rewrite those in Go … We are still trying to make GitLab use less memory. So, we'll need to enable multithreading. When we developed GitLab that was not common in the Ruby on Rails ecosystem. Now it's more common, but because we now have so much code and so many dependencies, it's going to be a longer path for us to get there. That should help; it won't make it blazingly fast, but at least it will use less memory.\"\n\nAdding Go to GitLab’s toolbox led to the creation of a separate service called [Gitaly](/blog/the-road-to-gitaly-1-0/), which handles all Git requests.\n\n### Building on GitLab’s mission\n\nThe organized, structured style of Ruby on Rails’ framework falls in line with our core mission. Because Rails is streamlined, anyone can jump into GitLab and participate, which made it especially attractive to Sid from the start.\n\n\"[Our mission is that everyone can contribute](/company/mission/#mission),\" he explains. \"Because Ruby on Rails is really opinionated about which pieces go where, it's much easier for new developers to get into the codebase, because you know where people have put stuff. For example, in every kitchen you enter, you never know where the knives and plates are located. But with Ruby on Rails, you enter the kitchen and it's always in the same place, and we want to stick to that.\n\n>In every kitchen you enter, you never know where the knives and plates are located. But with Ruby on Rails, you enter the kitchen and it's always in the same place, and we want to stick to that.\n\n\"I was really encouraged when I opened the project and saw it for the first time a year after Dmitriy started it. I opened it up and it's idiomatic Rails. He followed all the principles. He didn't try to experiment with some kind of fad that he was interested in. He made it into a production application. Dmitriy carefully vetted all the contributions to make sure they stick to those conventions, and that's still the case. I think we have a very nice codebase that allows other people to build on top of it. One of our sub-values is [boring solutions](https://handbook.gitlab.com/handbook/values/#efficiency): don't do anything fancy. This is so that others can build on top it. I think we've done that really well … and we're really thankful that Ruby has been such a stable, ecosystem for us to build on.\"\n\n[Cover image](https://unsplash.com/photos/0y6Y56Pw6DA) by [Elvir K](https://unsplash.com/@elvir) on Unsplash\n{: .note}\n",[987,266,1885,754,9,864,732],{"slug":2086,"featured":6,"template":688},"why-we-use-rails-to-build-gitlab","content:en-us:blog:why-we-use-rails-to-build-gitlab.yml","Why We Use Rails To Build Gitlab","en-us/blog/why-we-use-rails-to-build-gitlab.yml","en-us/blog/why-we-use-rails-to-build-gitlab",{"_path":2092,"_dir":243,"_draft":6,"_partial":6,"_locale":7,"seo":2093,"content":2099,"config":2104,"_id":2106,"_type":13,"title":2107,"_source":15,"_file":2108,"_stem":2109,"_extension":18},"/en-us/blog/a-beginners-guide-to-the-git-reftable-format",{"title":2094,"description":2095,"ogTitle":2094,"ogDescription":2095,"noIndex":6,"ogImage":2096,"ogUrl":2097,"ogSiteName":672,"ogType":673,"canonicalUrls":2097,"schema":2098},"A beginner's guide to the Git reftable format","In Git 2.45.0, GitLab upstreamed the reftable backend to Git, which completely changes how references are stored. Get an in-depth look at the inner workings of this new format.","https://res.cloudinary.com/about-gitlab-com/image/upload/v1749664595/Blog/Hero%20Images/blog-image-template-1800x945__9_.png","https://about.gitlab.com/blog/a-beginners-guide-to-the-git-reftable-format","\n                        {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Article\",\n        \"headline\": \"A beginner's guide to the Git reftable format\",\n        \"author\": [{\"@type\":\"Person\",\"name\":\"Patrick Steinhardt\"}],\n        \"datePublished\": \"2024-05-30\",\n      }",{"title":2094,"description":2095,"authors":2100,"heroImage":2096,"date":2101,"body":2102,"category":1507,"tags":2103},[1587],"2024-05-30","Until recently, the \"files\" format was the only way for Git to store references. With the [release of Git 2.45.0](https://about.gitlab.com/blog/whats-new-in-git-2-45-0/), Git can now store references in a \"reftable\" format. This new format is a binary format that is quite a bit more complex, but that complexity allows it to address several shortcomings of the \"files\" format. The design goals for the \"reftable\" format include:\n\n- Make the lookup of a single reference and iteration through ranges of references as efficient and fast as possible.\n- Support for consistent reads of references so that Git never reads an in-between state when an update to multiple references has been applied only partially.\n- Support for atomic writes such that updating multiple references can be implemented as an all-or-nothing operation.\n- Efficient storage of both refs and the reflog.\n\nIn this article, we will go under the hood of the \"reftable\" format to see exactly how it works.\n\n## How Git stores references\n\nBefore we dive into the details of the \"reftable\" format, let's quickly recap how Git has historically stored references. If you are already familiar with this, you can skip this section.\n\nA Git repository keeps track of two important data structures:\n\n- [Objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects), which contain the actual data of your repository. This includes commits, the directory tree structure, and the blobs that contain your source code. Objects point to each other, forming an object graph. Furthermore, each object has an object ID that uniquely identifies the object.\n\n- References, such as branches and tags, which are pointers into the object graph so that you can give objects names that are easier to remember and keep track of different tracks of your development history. For example, a repository may contain a `main` branch, which is a reference named `refs/heads/main` that points to a specific commit.\n\nReferences are stored in the reference database. Until Git 2.45.0, there was only the \"files\" database format. In this format, every reference is stored as a normal file that contains either one of the following:\n\n- A regular reference that contains the object ID of the commit it points to.\n- A symbolic reference that contains the name of another reference, similar to how a symbolic link points to another file.\n\nAt regular intervals, these references get packed into a single `packed-refs` file to make lookups more efficient.\n\nThe following examples should give an idea of how the \"files\" format operates:\n\n```shell\n$ git init .\n$ git commit --allow-empty --message \"Initial commit\"\n[main (root-commit) 6917c17] Initial commit\n\n# HEAD is a symbolic reference pointing to refs/heads/main.\n$ cat .git/HEAD\nref: refs/heads/main\n\n# refs/heads/main is a regular reference pointing to a commit.\n$ cat .git/refs/heads/main\n6917c178cfc3c50215a82cf959204e9934af24c8\n\n# git-pack-refs(1) packs these references into the packed-refs file.\n$ git pack-refs --all\n$ cat .git/packed-refs\n# pack-refs with: peeled fully-peeled sorted\n6917c178cfc3c50215a82cf959204e9934af24c8 refs/heads/main\n```\n\n## High-level structure of reftables\n\nAssuming that you've got Git 2.45.0 or newer installed, you can create a repository with the \"reftable\" format by using the `--ref-format=reftable` switch:\n\n```shell\n$ git init --ref-format=reftable .\nInitialized empty Git repository in /tmp/repo/.git/\n$ git rev-parse --show-ref-format\nreftable\n\n# Irrelevant files have been removed for ease of understanding.\n$ tree .git\n.git\n├── config\n├── HEAD\n├── index\n├── objects\n├── refs\n│   └── heads\n└── reftable\n\t├── 0x000000000001-0x000000000002-40a482a9.ref\n\t└── tables.list\n\n4 directories, 6 files\n```\n\nFirst, looking at the repository configuration, you will see it has an `extension.refstorage` key:\n\n```shell\n$ cat .git/config\n[core]\n    repositoryformatversion = 1\n    filemode = true\n    bare = false\n    logallrefupdates = true\n[extensions]\n    refstorage = reftable\n```\n\nThis configuration indicates to Git that the repository has been initialized with the \"reftable\" format and tells Git to use the \"reftable\" backend to access it.\n\nWeirdly enough, the repository still has a few files that look as if the \"files\" backend was in use:\n\n- `HEAD` would usually be a symbolic reference pointing to your currently checked-out branch. While it is not used by the \"reftable\" backend, it is required for Git clients to detect the directory as a Git repository. Therefore, when using the \"reftable\" format, `HEAD` is a stub with contents `ref: refs/heads/.invalid`.\n\n- `refs/heads` is a file with contents `this repository uses the reftable format`. Git clients that do not know about the \"reftable\" format would usually expect this path to be a directory. Consequently, creating this path as a file intentionally causes such older Git clients to fail if they tried to access the repository with the \"files\" backend.\n\nThe actual references are stored in the `reftable/` directory:\n\n```shell\n$ tree .git/reftable\n.git/reftable/\n├── 0x000000000001-0x000000000001-794bd722.ref\n└── tables.list\n\n$ cat .git/reftable/tables.list\n0x000000000001-0x000000000001-794bd722.ref\n```\n\nThere are two files here:\n\n- `0x000000000001-0x000000000001-794bd722.ref` is a table containing references and the reflog data in a binary format.\n\n- `tables.list` is, well, a list of tables. In the current state of the repository, the file contains a single line, which is the name of the table. This file tracks the current set of active tables in the \"reftable\" database and is updated whenever new tables get added to the repository.\n\nUpdating a reference creates a new table:\n\n```shell\n$ git commit --allow-empty --message \"Initial commit\"\n[main (root-commit) 1472a58] Initial commit\n\n$ tree .git/reftable\n.git/reftable/\n├── 0x000000000001-0x000000000002-eb87d12b.ref\n└── tables.list\n\n$ cat .git/reftable/tables.list\n0x000000000001-0x000000000002-eb87d12b.ref\n```\n\nAs you can see, the previous table has been replaced with a new one. Furthermore, the `tables.list` file has been updated to contain the new table.\n\n## The structure of a table\n\nAs mentioned earlier, the actual data of the reference database is contained in tables. Roughly speaking, a table is split up into multiple sections:\n\n- The \"header\" contains metadata about the table. Along with some other information, this includes the version of the format, the block size, and the hash function used by the repository (for example, SHA1 or SHA256).\n- The \"ref\" section contains your references. These records have a key that equals the reference name and point to either an object ID for regular references, or to another reference for symbolic references.\n- The \"obj\" section contains reverse mapping from object IDs to the references that point to those object IDs. These allow Git to efficiently look up which references point to a given object ID.\n- The \"log\" section contains your reflog entries. These records have a key that equals the reference name plus an index that represents the number of the log entry. Furthermore, they contain the old and new object IDs as well as the message for that reflog entry.\n- The \"footer\" contains offsets to the various sections.\n\n![long table with all the reftable sections](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749675179/Blog/Content%20Images/Frame_1_-_Reftable_overview.svg)\n\nEach of the section types are structured in a similar manner. Sections contain a set of records that are sorted by each record's key. For example, when you have two ref records `refs/heads/aaaaa` and `refs/heads/bbb`, you have two ref records with these reference names as their respective keys, and `refs/heads/aaaaa` would come before `refs/heads/bbb`.\n\nFurthermore, each section is divided into blocks of a fixed length. This block length is encoded in the header and serves two purposes:\n\n- Given the start of the section as well as the block size, the reader implicitly knows where each of the blocks starts. This allows Git to easily seek into the middle of a section without reading preceding blocks, which enables binary searches over blocks to speed up the lookup of records.\n- It ensures that the reader knows how much data to read from the disk at a time. Consequently, the block size is by default set to 4KiB, which is the most common sector size for hard disks. The maximum block size is 16MB.\n\nWhen we peek into, for example, a \"ref\" section, it looks roughly like the following graphic. Note how its records are ordered lexicographically inside the blocks, but also across the blocks.\n\n![reference block uncompressed](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749675179/Blog/Content%20Images/Frame_2_-_Ref_block_uncompressed.svg)\n\nEquipped with the current information, we can locate a record by using the following steps:\n\n1. Perform a binary search over the blocks by looking at the keys of their respective first records, identifying the block that must contain our record.\n\n2. Perform a linear search over the records in that block.\n\nBoth of these steps are still somewhat inefficient. If we have many blocks we may have to read logarithmically many of them in our binary search to find the desired one. And when blocks contain many records, we potentially have to read all of them during the linear search.\n\nThe \"reftable\" format has additional built-in mechanisms to address these performance concerns. We will touch on these over the next few sections.\n\n### Prefix compression\n\nAs you may have noticed, all of the record keys share the same prefix `refs/`. This is a common thing in Git:\n\n- All branches start with `refs/heads/`.\n- All tags start with `refs/tags/`.\n\nTherefore, we expect that subsequent records will most likely share a significant prefix of their key. This is a good opportunity to save some precious disk space. Because we know that most keys will share a common prefix, it makes sense to optimize for this.\n\nThe optimization uses prefix compression. Every record encodes a prefix length that tells the reader how many bytes to reuse from the key of the preceding record. If we have two records, `refs/heads/a` and `refs/heads/b`, the latter can be encoded by specifying a prefix length of 11 and then only storing the suffix `b`. The reader will then take the first 11 bytes of `refs/heads/a`, which is `refs/heads/`, and append the suffix `b` to it.\n\n![prefix compression](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749675179/Blog/Content%20Images/Frame_3_-_Ref_block_prefix_compression.svg)\n\n### Restart points\n\nAs explained earlier, the best way to search for a reference in a block with our current understanding of the \"reftable\" format is to do a linear search. This is because records do not have a fixed length, so it is impossible for us to tell where records would start without scanning through the block from the beginning. Also, even if records were of fixed length, we would not be able to seek into the middle of a block because the prefix compression also requires us to read preceding records.\n\nDoing a linear search would be quite inefficient because blocks may contain hundreds or even thousands of records. To address this issue, the \"reftable\" format encodes so-called restart points into every block. Restart points are uncompressed records where the prefix compression is reset. Consequently, records at restart points always contain their full key and it becomes possible to directly seek to and read the record without having to read preceding records. These restart points are listed in the footer of each block.\n\nEquipped with this information, we can avoid performing a linear search over the block. Instead, we can now do a binary search over the restart points where we search for the first restart point with a key larger than the sought-after key. From there, it follows that the desired record must be located in the section spanning from the _preceding_ restart point to the identified one.\n\nThus, our initial procedure to look up a record (binary search for the block, linear search for the record) is now:\n\n1. Perform a binary search over the blocks, identifying the block that must contain our record.\n\n2. Perform a binary search over the restart points, identifying the sub-section of the block that must contain our record.\n\n3. Perform a linear search over the records in that sub-section.\n\n![Linear search for a record](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749675179/Blog/Content%20Images/Frame_4_-_Restart_points.svg)\n\n### Indices\n\nWhile the search for records inside a block is now reasonably efficient, it's still inefficient to locate the block itself. A binary search may be reasonably performant when you have a couple of blocks, but repositories with millions of references may have hundreds or even thousands of blocks. Without any additional data structure, this would cause logarithmically many disk seeks on average.\n\nTo avoid this, every section may be followed by an index section that provides an efficient way to look up a block. Each index record holds the following information:\n\n- The location of the block that it is indexing.\n- The key of the last record of the block that it is indexing.\n\nWith three or less blocks, a binary search will always require, at most, two disk reads to find the desired target block. This is the same number of reads we would have to do with an index: one to read the index itself and one to read the desired block. Consequently, indices are only written when they would actually save some reads, which is the case with four or more indexed blocks.\n\nNow the question is: What happens when the index itself becomes so large that it spans over multiple blocks? You might have guessed it: We write another index that indexes the index. These multi-level indices really only become necessary once you have repositories with hundreds of thousands of references.\n\nEquipped with these indices, we can now make the procedure to look up records even more efficient:\n1. Determine whether there is an index by looking at the footer of the table.\n\t- If there is one, perform a binary search over the index to find the desired block. This block may point into an index block itself, in which case we need to repeat this step until we hit a record of the desired type.\n\t- Otherwise, perform a binary search over the blocks as we did before.\n2. Perform a binary search over the restart points, identifying the sub-section of the block that must contain our record.\n3. Perform a linear search over the records in that sub-section.\n\n## Multiple tables\n\nUp to this point, we have only discussed how to read a _single_ table. But as the name `tables.list` indicates, you can actually have a list of tables in your \"reftable\" database.\n\nEvery time you update a reference in your repository, a new table is written and appended to `tables.list`. Thus, you will eventually end up with multiple tables:\n\n```shell\n$ tree .git/reftable/\n.git/reftable/\n├── 0x000000000001-0x000000000007-8dcd8a77.ref\n├── 0x000000000008-0x000000000008-30e0f6f6.ref\n└── tables.list\n\n$ cat .git/reftable/tables.list\n0x000000000001-0x000000000007-8dcd8a77.ref\n0x000000000008-0x000000000008-30e0f6f6.ref\n```\n\nReading the actual state of a repository requires us to merge these multiple tables into a single virtual table.\n\nYou might be wondering: If a table is written for each reference update and the same reference is updated multiple times, how does the \"reftable\" format know the most up-to-date value of a given reference? Intuitively, one could assume the value would be the one from the newest table containing the reference.\n\nIn fact, every single record has a so-called update index that encodes the \"priority\" of a record. For example, if two ref records with the same name exist, then the one with the higher update index overrides the one with the lower update index.\n\nThese update indices are visible in the file structure above. The long hex strings (for example `0x000000000001`) are the update indices, where the left-hand side of the table name is the minimum update index contained in the table and the right-hand is the maximum update index.\n\nMerging the tables then happens via a [priority queue](https://en.wikipedia.org/wiki/Priority_queue) that is ordered by the key of the ref record as well as its update index. Assuming we want to scan through all ref records, we would:\n\n1. For every table, add its first record to the priority queue.\n\n![Adding first record to the priority queue](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749675179/Blog/Content%20Images/Frame_5_-_Priority_queue_1.svg)\n\n2. Yield the head of the priority queue. Because the queue is ordered by update index, it must be the most up-to-date version. Add the next item from that table to the priority queue.\n\n![Yielding the head of the priority queue](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749675179/Blog/Content%20Images/Frame_6_-_Priority_queue_2.svg)\n\n3. Drop all records from the queue that have the same name. These records are shadowed, which means that they will not be shown. For each table for which we are dropping records, add the next record to the priority queue.\n\n![Dropping all records from queue that have the same name](https://res.cloudinary.com/about-gitlab-com/image/upload/v1749675179/Blog/Content%20Images/Frame_7_-_Priority_queue_3.svg)\n\nNow we can rinse and repeat to read records for other keys.\n\nTables may contain special \"tombstone\" records that mark a record as having been deleted. This allows us to delete records without having to rewrite all tables to not contain the record anymore.\n\n### Auto-compaction\n\nWhile the idea behind the priority queue is simple enough, it would be rather inefficient to merge together hundreds or even only dozens of tables in this way. So while it is true that every update to your references appends a new table to your `tables.list` file, it is only part of the story.\n\nThe other part is auto-compaction: After a new table has been appended to the list of tables, the \"reftable\" backend checks whether some of the tables should be merged. This is done by using a simple heuristic: We check whether the list of tables forms a [geometric sequence](https://en.wikipedia.org/wiki/Geometric_progression) with the file sizes. Every table `n` must be at least twice as large as the next-most-recent table `n + 1`. If that geometric sequence is violated, the backend will compact tables so that the geometric sequence is restored.\n\nOver time, this will lead to structures that look like the following:\n\n```shell\n$ du --apparent-size .git/reftable/*\n429    .git/reftable/0x000000000001-0x00000000bd7c-d9819000.ref\n101    .git/reftable/0x00000000bd7d-0x00000000c5ac-c34b88a4.ref\n32    .git/reftable/0x00000000c5ad-0x00000000cc6c-60391f53.ref\n8    .git/reftable/0x00000000cc6d-0x00000000cdc1-61c30db1.ref\n3    .git/reftable/0x00000000cdc2-0x00000000ce67-d9b55a96.ref\n1    .git/reftable/0x00000000ce68-0x00000000ce6b-44721696.ref\n1    .git/reftable/tables.list\n```\n\nNote how for every single table, the property `size(n) > size(n+1) * 2` holds.\n\nOne of the consequences of auto-compaction is that the \"reftable\" backend maintains itself. We no longer have to run `git pack-refs` in a repository.\n\n## Want to learn more?\n\nYou should now have a good understanding of how the new \"reftable\" format works under the hood. If you want to dive even deeper into the format, you can refer to the [technical documentation](https://git-scm.com/docs/reftable) provided by the Git project.\n\n> Read our [Git 2.45.0 recap](https://about.gitlab.com/blog/whats-new-in-git-2-45-0/) to find out what else is in this version of Git.",[757,755,708,9],{"slug":2105,"featured":90,"template":688},"a-beginners-guide-to-the-git-reftable-format","content:en-us:blog:a-beginners-guide-to-the-git-reftable-format.yml","A Beginners Guide To The Git Reftable Format","en-us/blog/a-beginners-guide-to-the-git-reftable-format.yml","en-us/blog/a-beginners-guide-to-the-git-reftable-format",8,[665,693,717,739,766,788,808,830,850],1754424513868]