VDC sizing and design

Intro

One of the best definitions of an “information system” with a practical approach rather than an academic approach is – an automated system that produces data results for further usage. An algorithm (as an engine for any information system) is a rule to transform input data to output data. So fundamentally any information system is transforming input data to output data. We can even say that’s the sole reason for an information system to exist; therefore the value of an information system is defined through the value of the data. So any information system design starts with data and implements algorithms, hardware and everything else required to deliver data with known structure and value.

Prerequisites

Stored data

We start the design of an information system with data. First of all, we document all planned datasets for processing and storing. Data characteristics include:

  • the amount of data
  • data lifecycle (amount of new data per period, data lifetime, rules of processing outdated (dead) data)
  • data classification with relationship to core business from availability / integrity / confidentiality perspective including financial KPIs (like the financial impact from data lost over the last hour)
  • data processing geography (the physical location of data processing hardware)
  • external requirements for each data class (personal data laws, PCI DSS, HIPAA, SOX, medical data laws, etc).

Information systems

Data is not only stored, but also processed (transformed) by information systems. So the next step is to create a full inventory of all information systems, their architectural traits, interoperability and relationship, hardware requirements in abstract resources: Continue reading

DIY Storage benchmark for HCI and SDS

– Pilot, give me numbers!
– 36!
– What 36?
– What numbers?

This is an average picture of storage synthetic benchmarking today. But why? It was fine just a couple of years ago.

Until about 10 years ago most storage systems were flat arrays with uniform access. It means that arrays were created with big number of the very same disks (performance wise). For ex. 300x 15k disks. Uniform access means that access time to any data block is the same (not counting cache). Pretty the same story as with UMA/NUMA systems, the same principle.

About 10 years ago non-flat storage systems were introduced with SSD storage tier. Access time varies block-to-block now, and more interesting – completely unpredictable depending on vendors algorithms. Story was more or less settled, but guess what happens?

HCI with data locality appeared – NUMA of storage systems, or should we say NUSA (Non Uniform Storage Access)? Storage performance depends now on another factor – is data on particular node or should we travel through network. Our favorite synthetic tests such as single IOmeter with usual data access patterns now make a little sense. Real production workload is the only way to determine if multi-node HCI is suitable for real production workload. But what if we cannot move production due to security or cost considerations? Is there any other way?

Let’s pretend we have real production workload and put a load on whole HCI cluster. Strike out “100% random” on whole volume – this test won’t show us anything except performance of the lowest tier, and we can easily predict these numbers. 150-300 IOPS per node (2-4 SATA disks).

So what we need?

  1. 1 workload generator VM per node is a minimum.
  2. Workload profile similar to production.

For mass workloads such as VDI we have to create representative VM number. 100% equivalent to production is ideal, but as most demo systems we can use are 3-4 nodes there is no way to put 3000-4000 VMs there.

I will show you how to create benchmarking tool for HCI that makes sense. All next steps and screenshots are for Nutanix NX-3460G4 as I had this system, but you can easily reproduce the same story for any other system you have. Moreover, you can test classic FC SAN the very same way 🙂

image

I’ve used CentOS 7 with FIO as workload gen with workload profiles from Nutanix XRay 2.2. Why CentOS? ISO was already on my hard disk, nut you can use any other system you like.

Now we create several FIO VM templates for different workloads.

1. FIO Management – 1 vCPU, 2GB RAM, 20GB OS
2. DB – 1 vCPU, 2GB RAM, 20GB OS, 2*2 GB Log, 4*28 GB Data
3. VDI – 1 vCPU, 2GB RAM, 20GB OS, 10 GB Data

Let’s create FIO management VM with CentOS minimal install.

Now FIO installation.

# yum install wget
# wget http://dl.fedoraproject.org/pub/epel/testing/7/x86_64/Packages/f/fio-3.1-1.el7.x86_64.rpm
# yum install fio-3.1-1.el7.x86_64.rpm

Repeat the same steps for workload gen templates, minimal install for OS disk and we don’t touch other disks yet. Autostart FIO

Create file /etc/systemd/system/fio.service

[Unit]
Description=FIO server
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/fio --server
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target

# systemctl daemon-reload
# systemctl enable fio.service
# systemctl start fio.service
# firewall-cmd –zone=public –permanent –add-port=8765/tcp

Infrastructure is ready now and we need workload gens.
We create gens list
10.52.8.2 – 10.52.9.146

Excel as you know, is the most popular IP management solutiion in the world.

image

Upload this list to FIO management VM. Also we upload FIO config files with workload description.

fio-vdi.cfg

[global]
ioengine=libaio
direct=1
norandommap
time_based
group_reporting
disk_util=0
continue_on_error=all
rate_process=poisson
runtime=3600
[vdi-read]
filename=/dev/sdb
bssplit=8k/90:32k/10,8k/90:32k/10
size=8G
rw=randread
rate_iops=13
iodepth=8
percentage_random=80
[vdi-write]
filename=/dev/sdb
bs=32k
size=2G
offset=8G
rw=randwrite
rate_iops=10
percentage_random=20

fio-oltp.cfg

[global]
ioengine=libaio
direct=1
time_based
norandommap
group_reporting
disk_util=0
continue_on_error=all
rate_process=poisson
runtime=10000
[db-oltp1]
bssplit=8k/90:32k/10,8k/90:32k/10
size=28G
filename=/dev/sdd
rw=randrw
iodepth=8
rate_iops=500,500
[db-oltp2]
bssplit=8k/90:32k/10,8k/90:32k/10
size=28G
filename=/dev/sde
rw=randrw
iodepth=8
rate_iops=500,500
[db-oltp3]
bssplit=8k/90:32k/10,8k/90:32k/10
size=28G
filename=/dev/sdf
rw=randrw
iodepth=8
rate_iops=500,500
[db-oltp4]
bssplit=8k/90:32k/10,8k/90:32k/10
size=28G
filename=/dev/sdg
rw=randrw
iodepth=8
rate_iops=500,500
[db-log1]
bs=32k
size=2G
filename=/dev/sdb
rw=randwrite
percentage_random=10
iodepth=1
iodepth_batch=1
rate_iops=100
[db-log2]
bs=32k
size=2G
filename=/dev/sdc
rw=randwrite
percentage_random=10
iodepth=1
iodepth_batch=1
rate_iops=100

Now we prepare test system for mass VDI deployment. I’ve created special IP subnet just for VDI with Acropolis IPAM – AHV intercepts DHCP requests and serves VMs with IP.

image

As AHV serves IP not in first-to-last orders we just create IP pool of the required size: 400 VMs, 100 per host.

image

Spin off 400 VDI VMs.

image

image

Just to create 400 VMs is already a stress test for storage.

image

Under 2 minutes. Good result I think.

Power VMs on.

SSH to Nutanix CVM
# acli vm.on fio-vdi-*

And… FULL THROTTLE!

SSH to FIO management
# fio –client vdi.list vdi.cfg

Now your storage systems experience 400 average office users. You can easily modify config files according to your specific case.

I’ve also posted “average” OLTP config. Let’s spin off some DBs and add the workload to VDI.

# fio –client oltp.list fio-oltp.cfg

Everything goes great! System handles workload well, numbers are great. Let’s create disaster, MWAHAHA!

image

Now we have a perfect opportunity to see how the system handles fails. Pay special attention to “smart” systems with 1 and more hours delayed rebuilds. Hmmm… Host is down, not an issue, maybe. They will try to show you good numbers while test runs, but in production it can cost you a lot.
Nutanix is starting to rebuild data redundancy automatically in 30 seconds, even if it’s not a failure but a legitimate operation like host reboot during upgrade.

With such a tool you can easily test any offered storage/HCI system. Or, of course, you can download Nutanix XRay, that will test the system for you and provide nice GUI and a lot of graphs 🙂

What will virtualization look like in 10 years?

Let’s define “virtualization” for the start. Virtualization is abstraction from hardware itself, hardware microarchitecture. So, when we talk about virtualizion – it’s not just about server hypervisors and VMware virtual machines. Software defined storage, containers, even remote desktops and application streaming – all of these are virtualization technologies.

As of today, mid of 2016, server virtualization is almost stalled in progress. No breakthroughs for last several years. Honing hypervisors for perfection, challengers follow the lead and difference is less and less with each year. So vendors fight now in the ecosystem – management, orchestrators, monitoring tools and subsystems integration. We are now surprised when someone wants to buy a physical server and install Windows on it instead of hypervisor. Virtual machines are no longer an IT toys, it’s an industrial standard. Unfortunately sensible defense
scheme (from backups to virtualization-aware firewalls) is not yet standard feature.

Software Defined Everything, or we can say Virtualized Everything, grow enormously. Most of the corporate level storage systems are almost indistinguishable from standard x86 servers except of form factor. Vendors do not use special CPUs or ASICs anymore, putting powerful multicore Xeons in controllers instead. Storage system of today is actually just a standard server with standard hardware, just with a lot if disks and some specialized software. All the storage logic, RAIDs, replication and journaling is in software now. We blur the storage/server border even more with smart cache drivers and server side flash caches. Where the server ends and storage begins?

From other side we see pure software storage systems, never sold as hardware, which do not have hardware storage heritage and architecture traits. Take any server, put some disks and RAM as you please, install an application and voila! You lack space, performance or availability? Take another server, and another and maybe a couple more. It begins to look even more interesting when we install storage software in virtual machine or even make it a hypervisor module. There are no servers and storage apart – this is a computing/storage unified system. Hyperconverged infrastructure we call it. Virtual machines are running inside, virtual desktops and servers. More than that, users can not tell if they’re in dedicated VM or terminal server session or is just a streamed application. But who cares when you can connect from MacDonalds just across a globe?

Today we talk about containers, but it’s not a technological breakthrough. We knew about them for years, especially ISPs and hosting providers. What will happen in near future – is a merge of traditional full virtualization and containers in a single unified hypervisor. Docker and their rivals are not yet ready for production level corporate workloads. Still a lot of questions in security and QoS, but I bet it’s just a matter of couple of years. Well, maybe more than a couple, but 10 is more than enough. Where was VMware 10 years ago and where are we now in terms of server virtualization?
Network control plane is shifting more and more towards software, access level switching blurs more and more. Where is your access level when you have 100 VMs switching inside a hypervisor never reaching physical ports? The only part really left for specialized hardware is high speed core switches or ultra-low latency networks like Infiniband. But still, this is just a data plane, control plane lives in the Cloud.

Everything is moving towards the death of general OS as we know them. We don’t really need an OS actually, we only need it to run applications. And applications are more and more shifting from installable to portable containers. We’ll see hypervisor 2.0 as new general OS and further blur between desktop, laptop, tablet and smartphone. We still install applications, but we already store our data in the cloud. In 10 years container with application will be moving between desktop, smartphone and server infrastructure as easy as we move now virtual machines.

Some years ago we had to park floppy drive heads after we’re finished, teenagers of today live with cloud, teenagers of tomorrow will have to work hard to realize what is application/data link to hardware.

Cloud Trust

When we talk of information security, including cloud security, most of the talk is about confidentiality. Well, as from my experience almost no one talks about 2 other parts of the triad – integrity and availability. But these attributes become crucial in cloud.

Why are we doing cloud in the first place? To cut expenses, both capital and operational – dollar saved is dollar earned. Guess what cloud provider does? The very same thing, cutting expenses as much as they could. And there is no easy answer to the question: make cloud more secure or save some money.

Let’s take an easy example, how can cloud provider protect your data confidentiality?

For data at rest it’s pretty obvious answer – encryption. For data-in-flight there is no answer at all, encryption cannot protect from privileged insider – all the keys and hashes can be sniffed during live migration or through snapshotting. There are no measures to protect your data with 100% assurance, but all have costs. With the BIG providers you can be sure there are some internal security policies to prevent insider access and those who have access are not random people from the street. As cloud computing market grows we see a lot of smaller providers with nice prices for the service, but… So there are some basic questions for you provider you would really like to have an answer before moving your data:

  1. Who has an access to hardware?
  2. How much access do admins have?
  3. Who is watching them?
  4. Is there internal backup?
  5. Who has an access to backups?
  6. What really happens with our data when we close account?

I personally know a small company providing a very good service for accounting and supply management from the cloud. But they haven’t deleted any data in their entire history – everything is still in their databases. You closed your account 2 years ago – doesn’t matter. Data is still here.

Important part of the cloud is multitenancy – all the tenants use the very same shared hardware infrastructure, it saves money. But also it imposes new risks we never saw before cloud. Questions for provider:

  1. How tenants are isolated?
  2. Who grants tenant admin rights?
  3. Who is watching them (both admins and tenant admins)?
  4. How tenant admin is authenticated?
  5. What really happens with our data when we close account?

The last question is exactly the same, but with different aspect – who ensures our data is not accessible one way or another by other tenant taking over hardware resources we used to have?

And this is an easy part, because we’re moving to integrity and availability which are most of the time considered as operations team responsibility with almost no attention from security team.

Let’s say you’ve rented some VMs from the provider. How do you know where exactly data is stored and how reliable storage system is? Is it high end EMC Symmetrix system or DIY in garage 90TB storage like this one?

Most providers do not use classic corporate storage systems with known performance and proven reliability. DIY storage is way to cut really big piece of investment, but… here are 2 examples from Russian provider space:

  • “Selectel” have lost customers data several times due to problems with linux mdraid service.
  • “Cloudmouse” irreversibly lost 22 000 VMs due to problems with ceph service.

And personally I wonder – have these guys ever heard of backup? BTW have your provider heard?

Okay, I’ve scared you a little of cloud, so now let’s compare it to good old home-made IT. We’re building it for years and we know everything and control everything. Right?
98% of ITs I’ve seen – wrong. There are a lot of reasons for that, like:

  1. There is just not enough qualified personnel
  2. IT manager and whole IT department trying to maintain their personal importance instead of pursuing company needs
  3. There were mistakes made before and company still paying for that
  4. Some decisions were purely political instead of technical
  5. … and this list can be 100 pages long.

So what should we do about it and what’s the magic word?

It is Trust. And particularly Cloud Trust. I’ve tried to extract the meaning of this word:
– Trust is situation when you are sure in other party words/deeds

Outside IT you gain trust, it is a process. And you gain it with time when you prove yourself trustworthy. I believe everyone agree that you should trust your cloud provider if you move your data and intellectual property to their premises.

Experience is something you don’t get until just after you need it.

What we do with our relations with new people and establishing if we can trust them is calling for trusted 3rd party. You cannot be sure if a man or woman right across the table is a real doctor, so you ask for diploma from university you trust.
Unfortunately in cloud provider space there is no trusted authority to certify one or another provider. There are several organizations to help us though, like global Cloud Security Alliance with ready to use questionnaires. You just take it and ask your provider to answer these questions for you.

From other side what I see – most of companies exaggerate importance of their data, because they don’t really have a clue. Netherlands police for example took a deep look into data they have. Guess what they have found – 95% of everything they have is NOT confidential. How much commercial company data is really confidential you think?

What should you do before considering cloud services.

DO

  1. Clean up a mess in your internal IT. Cloud is about automation, and when you automate the mess – you get automated mess.
  2. Classify your data. There is no need in 100 different types and security classes, 3 to 5 would be just fine.
  3. Start with new non-confidential data.
  4. Start with new test zone in the cloud.
  5. Start with secondary and support processes.
  6. Deploy seasonal and peak loads in the cloud.
  7. Create and test backup policy with offsite data storage, so if cloud goes down you have at least backups.

DO NOT

  1. Replicate your services as they are.
  2. Move everything at once, especially business critical applications.

“IT vs Private Cloud” Paradox

Many years we speak of cloud computing, and I have been selling private cloud for a long time. But we’re still in very early stages of private cloud adoption. Why?

Answer was a surprise even for me. Private cloud is not something IT department need.

Every commercial company is a manufacturer. Yes, I’m not mistaken. Even small nail salon is a manufacturer. They produce profit. Just for argument simplicity let’s talk about profit as income minus costs (capital expenses and operational expenses including salaries). As we know dollar saved is dollar earned and therefore we’re driving costs down.
But where does cloud part come in you ask? Just wait for it.

Let’s take a look at allegedly most interested in cloud employees – IT department. Department includes IT management and administrators / specialists, IT assets in both hardware and software. And budget. As a rule, IT budget looks like some kind of financial black hole actively consuming sums with many zeroes. It’s almost impossible to understand financial flows and how it reflects on actual IT services. Here comes private cloud with financial visibility, service catalogs and measured service – so we can actually say how much one mailbox costs. We’re in CFO dream now.

But IT department says: NO!
RLY? WTF?

Ok, let’s take another look on IT department, completely unrelated to technology – motivation.

What average IT admin wants? Pretty simple answer: high-tech toys, arcane techno mage status and significance. Who should choose new servers/storage system? Of course ME, it’s MINE! No, it’s not. It’s a tool, not a toy, and cloud brings us standards for systems. More than that, cloud makes admin interchangeable, the role does not bear any arcane knowledge anymore. Cloud admin is highly qualified in several areas – yes, but I don’t really see a lot of admins after 30 who really want to study something new and adapt. People want stability and “expert” title. What they do not want is to remain students till grandchildren.

What does IT management want if we skip part with kickbacks and gray schemes on procurement? Pretty the same – influence and significance. Which directly translates to number of employees and total systems cost. Plus a budget to control themselves, with no one looking over the shoulder. Each new new employee reporting bring costs, and each new admin add NO to the cloud question.

What cloud makes with IT budget? Black hole splits into separate services with measured costs, and CFO can now compare internal services with available on the open market. Which can be not in internal services favor. Cloud brings financial visibility to financial management and line business managers as well as how to spend budget in accordance with company targets.

– What, board will be able to see how I spend my budget?! – direct quote from one CIO I met.

It’s not a paradox, we now understand why IT don’t like cloud. But what should we do? I don’t have that answer.

Insider threat for Cloud. Some thoughts.

As we move towards 100% virtualization the role of vAdministrator appears more and more important. vAdmin can rule all the infrastructure from one single console, unlike years before. One of Top3 US banks can be brought down completely by a single script, imagine that!
We start to see more and more cases when fired admin log in to ex-employers infrastructure via McDonalds WiFi and delete some critical data.
Let’s take CodeSpaces example – hackers wanted a lot of money, but didn’t get it. So they just deleted everything, including backups.

The only thing growing faster than IT security spending is the cost of security beaches. That’s the reality we see today.

Without any questions level of control will be increasing as well as pressing on privileged users and admins. But what really surprised me – 4 security pros on the stage (SEC2296, VMworld 2014) have said nothing about organizational problems in this security nightmare with insiders.

Let’s think about it a little. Insider is the person inside the company – employee most of time. And we can divide them into 3 basic categories:

  1. These people will do something bad and sell company’s secrets no matter what.
  2. People who can do something bad or do nothing.
  3. Angels. They will do nothing bad even if management will do something bad to them.

Type 1 insiders should be discovered ASAP, ideally even on interview – that’s why there are HR professionals involved and background checks performed.
Type 3 insiders are not a threat.

There are still type 2 people left, and that’s the type we ignore. Majority of any employees in any company. These people will do something bad as retaliation, they will not strike first. And guess what we’re doing to them?
– put under suspicion and constant control
– treat all their activity as they’re type 1 people
– completely ignore their personality, treating like replaceable and expendable working unit.

I can assure you – nothing is more stimulating like this kind of treatment for employer when you have an access to most critical services.

There is NO statistics on percentage of incidents caused by bad management treating employees like a trash. And we try to solve organizational problem technically, without any human interaction. Is this because we’re techno geeks lacking social skills or just because it’s more difficult and complex than to put web cams everywhere including restrooms?
At some point there can be ONLY trust. Imagine you’re on the operating table – how can you enforce security and be sure surgeon will do only permitted actions? There is no way, period. We’re giving very high rights to the surgeon, and we’re (society) also give very high responsibility.
Virtualization administrator with highest access is the very same surgeon operating on organization’s IT heart, sometimes while the heart still beats. So why we take a look on what admin is doing and not on how manager treats him / her?

So, after years of experience and thoughts I see 2 basic rules of information security when we talk about these type 2 guys and gals with full access.

  1. Insider threat becomes VERY real when you treat your employees and colleagues as insiders and threat instead of people who help. When you see them as easily replaceable and expendable working units.
  2. Employee’s loyalty to company starts with company’s loyalty to employee.

We should solve organizational and administrative problems first, otherwise technical solutions will be useless. Or even they will even lower overall security.